We would like to inform you of a known issue involving the communication library UCX version 1.16.0 (included in Mellanox OFED 23.10-3.2.2.0-LTS) currently installed on SQUID. This issue causes certain programs including OpenFOAM to terminate abnormally. Details can be found in this page.
[Issue Details]
A floating-point exception occurs within the MPI_Init_thread function, leading to a forced termination of the program with the following message:
Caught signal 8 (Floating point exception: floating-point invalid operation)
[Affected Programs]
Programs built using Intel compilers (e.g., icc, ifort) with the -fpe0 option. We have confirmed that the pre-installed version of OpenFOAM on SQUID is affected.
[Time of Occurrence]
This issue has been occurring since the end-of-year maintenance in FY2024, after the upgrade to UCX 1.16.0.
[Workaround]
Please add the following environment variable in your job script or environment configuration file to avoid the issue:
export UCX_PROTO_ENABLE=n
We have confirmed that the floating-point exception occurs within the automatic communication protocol selection process introduced as part of UCX Protocols v2, which is enabled by default starting from UCX 1.16.0. By setting the above environment variable, the system is forced to revert to the legacy UCX Protocols v1, thereby avoiding the occurrence of the floating-point exception.
[Future Measures]
We are planning to upgrade to a version of UCX where this issue has been fixed. We will notify users once the upgrade schedule is determined.
We kindly ask that all users apply the above workaround to avoid the impact of this issue. Thank you for your understanding and cooperation.
Posted : May 28,2025