-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX nonblocking accumulate leads to program freeze #13064
Comments
Have you tried Open MPI 5.0.6 and the latest UCX? |
I had some trouble getting OpenMPI's configure to find a freshly built UCX, so I used Spack to build OpenMPI 5.0.6 against UCX 1.17.0 and 1.18.0 (after adding in the new release). The problem occurs just the same as with OpenMPI 5.0.2 + UCX 1.16.0. E: I got around to testing UCX 1.18.0 (provided by Arch packagers) + OpenMPI 5.0.6 (self-built) on my local machine. There the freeze doesn't occur, but there's a large difference in hardware and configure flags. I am currently cycling through the different configure flags to see if it makes any difference on the cluster. |
I accidentally used the wrong executable on the cluster which was still linked to UCX 1.16.0. When actually using UCX 1.18.0 + OpenMPI 5.0.6 I now get an error as follows, regardless of passing the nonblocking MCA parameter or not. It does change the backtrace a bit since different functions get called under the hood. Snippets for the backtrace, full log of one run is attached at the end too. It feels more like a weird hardware/ucx interaction by now. blocking accumulate:
nonblocking accumulate:
Full log for nonblocking accumulate: |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
mpirun --version
mpirun (Open MPI) 5.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Provided by cluster operator, going by ompi_info compiled from source.
Please describe the system on which you are running
Details of the problem
I have an application using a lot of one-sided communication with small request sizes. When I went to scale it up, it had a really sharp drop in performance when going to two nodes, which led me to find that the one-sided calls were responsible for this. It's mainly MPI_Accumulate (20 bytes per call) and MPI_Compare_and_swap (4 bytes per call) which cause that. I looked over potential MCA parameters which could give me a work-free speedup and so tried passing
--mca osc_ucx_enable_nonblocking_accumulate true
to mpirun, since ~12us per MPI_Accumulate according to my profiler (nsys) sounded like a blocking round-trip to me, with MPI_Send latency at 2.5us and MPI_Put/Get at 5us. Adding that MCA parameter made my application freeze, so I broke it down to a self-contained example to figure out the conditions for the freezing.
Reproducer code is attached, a high level description follows:
rma_freeze.c.txt
I allocate a window with MPI_Win_allocate, lock it in a shared manner for all processes, then fill it with MPI_Put + flush. Afterwards, I call MPI_Accumulate in a loop to increment the non-locally owned part. This goes fine until some maximum amount of data has been put through MPI_Accumulate. Between iterations I call MPI_Window_flush_all which should ensure that all operations have completed between iterations.
When I accumulate 512 ints per iteration, the reproducer hangs at the 65th iteration, suggesting 2^15 elements or 2^17 byte to be a magic number somewhere. Accumulating 2^14 ints per iteration gives a freeze at the 3rd iteration, also suggesting the same magic number. If I only do one int per iteration, I can get slightly beyond 2^15 elements (32801).
This would usually suggest a limit to the number of outstanding nonblocking calls, with smaller requests being served quickly enough. But since MPI_Window_flush_all is called between iteration, there should be no outstanding operations between iterations. By printing the value it also looks like the operation has finished between iterations.
Without
osc_ucx_enable_nonblocking_accumulate
any amount of data put through MPI_Accumulate works. Looping until some critical value is safe, but not useful for my application and probably indicates a bug somewhere.I don't know whether this hints that the bug lies more in OpenMPI or UCX, but I also tried compiling OpenMPI 5.0.6 (source tarball from the OpenMPI site) while linking against the provided system (except for internal hwloc). The freezing problem occurs just the same in both versions.
Compilation + run example:
I run the reproducer on the cluster via the attached jobfile, which has a slightly different call since that's required to get it running on multiple nodes.
rmatest.sh.txt
Logs for the 512x64 (runs) and 512x65 (freezes) cases:
rmalog_512_65.txt
rmalog_512_64.txt
The text was updated successfully, but these errors were encountered: