Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX nonblocking accumulate leads to program freeze #13064

Open
mts42 opened this issue Jan 28, 2025 · 3 comments
Open

UCX nonblocking accumulate leads to program freeze #13064

mts42 opened this issue Jan 28, 2025 · 3 comments

Comments

@mts42
Copy link

mts42 commented Jan 28, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

mpirun --version
mpirun (Open MPI) 5.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Provided by cluster operator, going by ompi_info compiled from source.

Please describe the system on which you are running

  • Operating system/version: RHEL 9.3, uname -r: 5.14.0-362.18.1.el9_3.x86_64
  • Computer hardware: AMD EPYC 9654, NVIDIA H100 GPUs (not relevant here I think)
  • Network type: InfiniBand NDR200 200Gbps x 4 (also happens on a single node if UCX is forced for osc)
  • UCX version just in case: 1.16.0

Details of the problem

I have an application using a lot of one-sided communication with small request sizes. When I went to scale it up, it had a really sharp drop in performance when going to two nodes, which led me to find that the one-sided calls were responsible for this. It's mainly MPI_Accumulate (20 bytes per call) and MPI_Compare_and_swap (4 bytes per call) which cause that. I looked over potential MCA parameters which could give me a work-free speedup and so tried passing

--mca osc_ucx_enable_nonblocking_accumulate true

to mpirun, since ~12us per MPI_Accumulate according to my profiler (nsys) sounded like a blocking round-trip to me, with MPI_Send latency at 2.5us and MPI_Put/Get at 5us. Adding that MCA parameter made my application freeze, so I broke it down to a self-contained example to figure out the conditions for the freezing.

Reproducer code is attached, a high level description follows:
rma_freeze.c.txt

I allocate a window with MPI_Win_allocate, lock it in a shared manner for all processes, then fill it with MPI_Put + flush. Afterwards, I call MPI_Accumulate in a loop to increment the non-locally owned part. This goes fine until some maximum amount of data has been put through MPI_Accumulate. Between iterations I call MPI_Window_flush_all which should ensure that all operations have completed between iterations.

When I accumulate 512 ints per iteration, the reproducer hangs at the 65th iteration, suggesting 2^15 elements or 2^17 byte to be a magic number somewhere. Accumulating 2^14 ints per iteration gives a freeze at the 3rd iteration, also suggesting the same magic number. If I only do one int per iteration, I can get slightly beyond 2^15 elements (32801).

This would usually suggest a limit to the number of outstanding nonblocking calls, with smaller requests being served quickly enough. But since MPI_Window_flush_all is called between iteration, there should be no outstanding operations between iterations. By printing the value it also looks like the operation has finished between iterations.

Without osc_ucx_enable_nonblocking_accumulate any amount of data put through MPI_Accumulate works. Looping until some critical value is safe, but not useful for my application and probably indicates a bug somewhere.

I don't know whether this hints that the bug lies more in OpenMPI or UCX, but I also tried compiling OpenMPI 5.0.6 (source tarball from the OpenMPI site) while linking against the provided system (except for internal hwloc). The freezing problem occurs just the same in both versions.

Compilation + run example:

mpicc rma_freeze.c -o rma_freeze
# first is number of MPI_Accumulate calls between window flushes, second is no. iterations
# so this one works fine since 33*512 < 2^15
mpirun --mca osc ucx --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 --mca btl_base_verbose 100 --mca osc_base_verbose 100 --mca osc_rdma_verbose 100  --mca osc_ucx_enable_nonblocking_accumulate true -np 2 rma_freeze 512 33
# this one freezes since 70*512 > 2^15
mpirun --mca osc ucx --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 --mca btl_base_verbose 100 --mca osc_base_verbose 100 --mca osc_rdma_verbose 100  --mca osc_ucx_enable_nonblocking_accumulate true -np 2 rma_freeze 512 70

I run the reproducer on the cluster via the attached jobfile, which has a slightly different call since that's required to get it running on multiple nodes.
rmatest.sh.txt

Logs for the 512x64 (runs) and 512x65 (freezes) cases:

rmalog_512_65.txt
rmalog_512_64.txt

@jsquyres
Copy link
Member

Have you tried Open MPI 5.0.6 and the latest UCX?

@mts42
Copy link
Author

mts42 commented Jan 30, 2025

I had some trouble getting OpenMPI's configure to find a freshly built UCX, so I used Spack to build OpenMPI 5.0.6 against UCX 1.17.0 and 1.18.0 (after adding in the new release). The problem occurs just the same as with OpenMPI 5.0.2 + UCX 1.16.0.

E: I got around to testing UCX 1.18.0 (provided by Arch packagers) + OpenMPI 5.0.6 (self-built) on my local machine. There the freeze doesn't occur, but there's a large difference in hardware and configure flags. I am currently cycling through the different configure flags to see if it makes any difference on the cluster.

@mts42
Copy link
Author

mts42 commented Jan 30, 2025

I accidentally used the wrong executable on the cluster which was still linked to UCX 1.16.0. When actually using UCX 1.18.0 + OpenMPI 5.0.6 I now get an error as follows, regardless of passing the nonblocking MCA parameter or not. It does change the backtrace a bit since different functions get called under the hood. Snippets for the backtrace, full log of one run is attached at the end too. It feels more like a weird hardware/ucx interaction by now.

blocking accumulate:

<snip spam>
[r10n10:2279697] select: component ucx selected
[r10n10:2279697:0:2279697] rc_verbs_iface.c:124  send completion with error: remote access error [qpn 0x32e7 wrid 0xd vendor_err 0x88]
[r10n10:2279697:0:2279697] rc_verbs_iface.c:124  [rqpn 0x59c0 dlid=1011 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2279697) ====
 0  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_handle_error+0x294) [0x14ca422fe464]
 1  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_fatal_error_message+0xba) [0x14ca422fb61a]
 2  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_default_handler+0xef2) [0x14ca42300032]
 3  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_dispatch+0xd1) [0x14ca42300401]
 4  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/ucx/libuct_ib.so.0(+0x1fee1) [0x14ca3d309ee1]
 5  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/ucx/libuct_ib.so.0(+0x21309) [0x14ca3d30b309]
 6  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(ucp_worker_progress+0x5a) [0x14ca4238608a]
 7  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(ucp_worker_fence+0x26f) [0x14ca423a70ef]
 8  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libopen-pal.so.80(opal_common_ucx_wpmem_fence+0x52) [0x14ca424efa72]
 9  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(+0x2357ef) [0x14ca42e237ef]
10  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(ompi_osc_ucx_accumulate+0x17) [0x14ca42e27f47]
11  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(PMPI_Accumulate+0xd8) [0x14ca42c92b18]
12  rma_freeze_spackompi() [0x401791]
13  /lib64/libc.so.6(+0x3feb0) [0x14ca42a13eb0]
14  /lib64/libc.so.6(__libc_start_main+0x80) [0x14ca42a13f60]
15  rma_freeze_spackompi() [0x401165]

nonblocking accumulate:

<snip spam>
[r19n10:1099669] select: component ucx selected
[r22n7:2563887:0:2563887] rc_verbs_iface.c:124  send completion with error: Work Request Flushed Error [qpn 0xe5eb wrid 0x17 vendor_err 0xf4]
[r22n7:2563887:0:2563887] rc_verbs_iface.c:124  [rqpn 0x150cb dlid=1009 sl=0 port=1 src_path_bits=0]
[r19n10:1099669:0:1099669] rc_verbs_iface.c:124  send completion with error: remote access error [qpn 0x150cb wrid 0x15 vendor_err 0x88]
[r19n10:1099669:0:1099669] rc_verbs_iface.c:124  [rqpn 0xe5eb dlid=963 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2563887) ====
 0  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_handle_error+0x294) [0x14ebda2a2464]
 1  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_fatal_error_message+0xba) [0x14ebda29f61a]
 2  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_default_handler+0xef2) [0x14ebda2a4032]
 3  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_dispatch+0xd1) [0x14ebda2a4401]
 4  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/ucx/libuct_ib.so.0(+0x1fee1) [0x14ebd52adee1]
==== backtrace (tid:1099669) ====
 0  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_handle_error+0x294) [0x14c84eb0a464]
 1  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_fatal_error_message+0xba) [0x14c84eb0761a]
 2  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_default_handler+0xef2) [0x14c84eb0c032]
 3  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucs.so.0(ucs_log_dispatch+0xd1) [0x14c84eb0c401]
 4  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/ucx/libuct_ib.so.0(+0x1fee1) [0x14c84c113ee1]
 5  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/ucx/libuct_ib.so.0(+0x21309) [0x14c84c115309]
 6  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(ucp_worker_progress+0x5a) [0x14c84eb9208a]
 7  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(ucp_worker_fence+0x26f) [0x14c84ebb30ef]
 8  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libopen-pal.so.80(opal_common_ucx_wpmem_fence+0x52) [0x14c84ecfba72]
 9  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(ompi_osc_ucx_req_completion+0x277) [0x14c84f633067]
10  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libopen-pal.so.80(opal_common_ucx_req_completion+0x12) [0x14c84ecf9772]
11  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(+0x6f75b) [0x14c84ebb175b]
12  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(+0x7057f) [0x14c84ebb257f]
13  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/ucx-1.18.0-2tcmd53og7l2k6fs2n2ujgslyicswo3h/lib/libucp.so.0(ucp_worker_progress+0x5a) [0x14c84eb9208a]
14  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(+0x2321ca) [0x14c84f62c1ca]
15  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(+0x238043) [0x14c84f632043]
16  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(ompi_osc_ucx_accumulate_nb+0x24) [0x14c84f633f74]
17  /home/5/uq04775/workspace/spack/opt/spack/linux-rhel9-zen3/gcc-11.4.1/openmpi-5.0.6-5o5ov3j5rgawco3zri7lo3yjotjlrarv/lib/libmpi.so.40(PMPI_Accumulate+0xd8) [0x14c84f49eb18]
18  rma_freeze_spackompi() [0x401791]
19  /lib64/libc.so.6(+0x3feb0) [0x14c84f21feb0]
20  /lib64/libc.so.6(__libc_start_main+0x80) [0x14c84f21ff60]
21  rma_freeze_spackompi() [0x401165]

Full log for nonblocking accumulate:

ompi506_ucx118nb_rmalog_512_64.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants