Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using ncclCommSplit with multiple threads and non-blocking init #1605

Open
lw opened this issue Feb 14, 2025 · 8 comments
Open

Comments

@lw
Copy link

lw commented Feb 14, 2025

I have 4 GPUs (2 nodes with 2 GPUs each) and I'm initializing a global communicator, doing a "barrier" (all-reduce on 4 bytes followed by CUDA device sync), splitting a cross-node "network rail" communicator (ranks 0<->2 and 1<->3), another barrier, splitting a intra-node communicator (ranks 0<->1 and 2<->3) and another barrier. However, when I do an all-gather shortly after, I encounter a segfault in ncclGroupCommJoin.

I'm using NCCL via PyTorch. I initially encountered this issue on v2.21.5 (shipped with PyTorch), but I can also repro it on v2.25.1 (coming from nvidia-nccl-cu12 from PyPI). This problem only seems to occur when I use PyTorch's new "bound device id" feature, which initializes NCCL eagerly and non-blocking, and allows PyTorch to use ncclCommSplit to create subgroups.

I noticed that the segfault happens during the backward pass, which PyTorch executes on a separate thread for autograd. This likely means that all the all-gathers during the forward (in the main thread) were successful.

Here is the information I could gather with GDB (let me know if I can get you more details, I can repro this reliably):

(gdb) bt
#0  0x00007bd19085bb28 in ncclGroupCommJoin (comm=<optimized out>) at include/group.h:113
#1  taskAppend (info=0x7bcc20b30270, comm=0x611edb75d650) at enqueue.cc:2152
#2  ncclEnqueueCheck (info=info@entry=0x7bcc20b30270) at enqueue.cc:2224
#3  0x00007bd19084e991 in ncclAllGather (sendbuff=0x7bcbc4a00000, recvbuff=0x7bcbc4a00000, sendcount=54530048, datatype=ncclBfloat16, comm=0x611edb75d650, stream=<optimized out>) at collectives.cc:88
#4  0x00007bd201a27322 in c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so
#5  0x00007bd23aecb068 in c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007bd23aedaaac in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) ()
   from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007bd23a5a36e0 in torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007bd23aef3472 in c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007bd24a3998d1 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#10 0x00007bd249aae1dd in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#11 0x0000611ecacbf588 in cfunction_call (func=0x7bd18eb0e0c0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.8/Objects/methodobject.c:537
#12 0x0000611ecac9f75b in _PyObject_MakeTpCall (tstate=0x7bcb60013bb0, callable=0x7bd18eb0e0c0, args=0x7bcf24060618, nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.8/Objects/call.c:240
#13 0x0000611ecabad6a1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7bcf24060590, throwflag=<optimized out>) at Python/bytecodes.c:2715
#14 0x0000611ecacec542 in _PyEval_EvalFrame (throwflag=0, frame=0x7bcf24060020, tstate=0x7bcb60013bb0) at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_ceval.h:89
#15 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7bcc20b311f0, locals=0x0, func=0x7bd17be00680, tstate=0x7bcb60013bb0) at /usr/local/src/conda/python-3.12.8/Python/ceval.c:1683
#16 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7bcc20b311f0, func=0x7bd17be00680) at /usr/local/src/conda/python-3.12.8/Objects/call.c:419
#17 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7bcc20b311f0, callable=0x7bd17be00680, tstate=0x7bcb60013bb0) at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92
#18 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.8/Objects/classobject.c:91
#19 0x00007bd249e19c49 in torch::autograd::(anonymous namespace)::_call_hooks(_object*, _object*) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#20 0x00007bd249e1ab90 in torch::autograd::PyFunctionTensorPreHook::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#21 0x00007bd23a5aed27 in torch::autograd::call_tensor_pre_hooks(torch::autograd::Node&, std::vector<at::Tensor, std::allocator<at::Tensor> >) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007bd23a5b589d in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007bd23a5b735f in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007bd23a5af5b1 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007bd249e04461 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /my/conda/env/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#26 0x00007bd251cf1b65 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#27 0x00007bd256c66ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#28 0x00007bd256cf8850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p comm
$1 = <optimized out>
(gdb) p comm->groupNext
value has been optimized out
(gdb) p pp
$2 = (ncclComm **) 0x611edb8608c0
(gdb) p *pp
$3 = (ncclComm *) 0x1
(gdb) p comm->intraComm0
value has been optimized out
(gdb) p (*pp)->intraComm0
Cannot access memory at address 0x78ba1
(gdb) p (*pp)->groupNext
Cannot access memory at address 0x78d01
(gdb) p &(*pp)->groupNext
$4 = (ncclComm **) 0x78d01

The segfault thus likely occurs when accessing (*pp)->intraComm0.

Here is the output of NCCL_DEBUG=INFO on rank 0 (note that all ranks hit the segfault at the exact same time):

myhost:101952:101952 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
myhost:101952:101952 [0] NCCL INFO Bootstrap: Using eth0:10.0.214.75<0>
myhost:101952:101952 [0] NCCL INFO cudaDriverVersion 12040
myhost:101952:101952 [0] NCCL INFO NCCL version 2.25.1+cuda12.2
myhost:101952:101952 [0] NCCL INFO Comm config Blocking set to 0
myhost:101952:102026 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol.
myhost:101952:102026 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8)
myhost:101952:102026 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol.
myhost:101952:102026 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8)
myhost:101952:102026 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
myhost:101952:102026 [0] NCCL INFO P2P plugin v8 IBext_v8
myhost:101952:102026 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
myhost:101952:102026 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
myhost:101952:102026 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [4]ibp4:1/IB/SHARP [5]ibp5:1/IB/SHARP [6]ibp6:1/IB/SHARP [7]ibp7:1/IB/SHARP [RO]; OOB eth0:10.0.214.75<0>
myhost:101952:102026 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myhost:101952:102026 [0] NCCL INFO Using network IBext_v8
myhost:101952:102026 [0] NCCL INFO ncclCommInitRankConfig comm 0x611eda635410 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1a000 commId 0xbce14ccdf922c960 - Init START
myhost:101952:102026 [0] NCCL INFO RAS client listening socket at ::1<28028>
myhost:101952:102026 [0] NCCL INFO Bootstrap timings total 0.020716 (create 0.000029, send 0.000073, recv 0.000260, ring 0.000164, delay 0.000000)
myhost:101952:102026 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:101952:102026 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:101952:102026 [0] NCCL INFO comm 0x611eda635410 rank 0 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
myhost:101952:102026 [0] NCCL INFO Channel 00/08 : 0 1 2 3
myhost:101952:102026 [0] NCCL INFO Channel 01/08 : 0 3 2 1
myhost:101952:102026 [0] NCCL INFO Channel 02/08 : 0 1 2 3
myhost:101952:102026 [0] NCCL INFO Channel 03/08 : 0 3 2 1
myhost:101952:102026 [0] NCCL INFO Channel 04/08 : 0 1 2 3
myhost:101952:102026 [0] NCCL INFO Channel 05/08 : 0 3 2 1
myhost:101952:102026 [0] NCCL INFO Channel 06/08 : 0 1 2 3
myhost:101952:102026 [0] NCCL INFO Channel 07/08 : 0 3 2 1
myhost:101952:102026 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/2/-1->0->-1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->2 [5] -1/-1/-1->0->1 [6] 1/-1/-1->0->2 [7] -1/-1/-1->0->1
myhost:101952:102026 [0] NCCL INFO P2P Chunksize set to 131072
myhost:101952:102026 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
myhost:101952:102045 [0] NCCL INFO [Proxy Service] Device 0 CPU core 14
myhost:101952:102047 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 16
myhost:101952:102026 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myhost:101952:102026 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
myhost:101952:102026 [0] NCCL INFO CC Off, workFifoBytes 1048576
myhost:101952:102026 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
myhost:101952:102026 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
myhost:101952:102026 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
myhost:101952:102026 [0] NCCL INFO ncclCommInitRankConfig comm 0x611eda635410 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1a000 commId 0xbce14ccdf922c960 - Init COMPLETE
myhost:101952:102026 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 1.34 (kernels 0.23, alloc 0.92, bootstrap 0.02, allgathers 0.01, topo 0.14, graphs 0.02, connections 0.00, rest 0.00)
myhost:101952:102053 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 24
myhost:101952:102050 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 02/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 04/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 06/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102050 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102050 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102050 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102050 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 03/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:101952:102050 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:101952:102045 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
myhost:101952:102050 [0] NCCL INFO Connected all rings, use ring PXN 1 GDR 1
myhost:101952:101952 [0] NCCL INFO Comm config Blocking set to 0
myhost:101952:102062 [0] NCCL INFO Using network IBext_v8
myhost:101952:102062 [0] NCCL INFO ncclCommSplit comm 0x611edb75d650 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x611eda635410 splitCount 1 color 1684518034 key 0- Init START
myhost:101952:102062 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:101952:102062 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:101952:102062 [0] NCCL INFO comm 0x611edb75d650 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
myhost:101952:102062 [0] NCCL INFO Channel 00/04 : 0 1
myhost:101952:102062 [0] NCCL INFO Channel 01/04 : 0 1
myhost:101952:102062 [0] NCCL INFO Channel 02/04 : 0 1
myhost:101952:102062 [0] NCCL INFO Channel 03/04 : 0 1
myhost:101952:102062 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
myhost:101952:102062 [0] NCCL INFO P2P Chunksize set to 131072
myhost:101952:102062 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
myhost:101952:102063 [0] NCCL INFO [Proxy Service] Device 0 CPU core 34
myhost:101952:102064 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 38
myhost:101952:102062 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
myhost:101952:102062 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myhost:101952:102062 [0] NCCL INFO CC Off, workFifoBytes 1048576
myhost:101952:102062 [0] NCCL INFO ncclCommSplit comm 0x611edb75d650 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x611eda635410 splitCount 1 color 1684518034 key 0 - Init COMPLETE
myhost:101952:102062 [0] NCCL INFO Init timings - ncclCommSplit: rank 0 nranks 2 total 0.15 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.15, graphs 0.00, connections 0.00, rest 0.00)
myhost:101952:102071 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 112
myhost:101952:102066 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:101952:102066 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
myhost:101952:101952 [0] NCCL INFO Comm config Blocking set to 0
myhost:101952:102076 [0] NCCL INFO Using network IBext_v8
myhost:101952:102076 [0] NCCL INFO ncclCommSplit comm 0x611edb7e7bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x611eda635410 splitCount 2 color 2130503744 key 0- Init START
myhost:101952:102076 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:101952:102076 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:101952:102076 [0] NCCL INFO comm 0x611edb7e7bc0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
myhost:101952:102076 [0] NCCL INFO Channel 00/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 01/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 02/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 03/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 04/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 05/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 06/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 07/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 08/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 09/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 10/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 11/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 12/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 13/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 14/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 15/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 16/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 17/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 18/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 19/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 20/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 21/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 22/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Channel 23/24 : 0 1
myhost:101952:102076 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
myhost:101952:102076 [0] NCCL INFO P2P Chunksize set to 524288
myhost:101952:102076 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
myhost:101952:102082 [0] NCCL INFO [Proxy Service] Device 0 CPU core 86
myhost:101952:102084 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 24
myhost:101952:102076 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
myhost:101952:102076 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
myhost:101952:102076 [0] NCCL INFO CC Off, workFifoBytes 1048576
myhost:101952:102076 [0] NCCL INFO ncclCommSplit comm 0x611edb7e7bc0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x611eda635410 splitCount 2 color 2130503744 key 0 - Init COMPLETE
myhost:101952:102076 [0] NCCL INFO Init timings - ncclCommSplit: rank 0 nranks 2 total 0.19 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.16, graphs 0.00, connections 0.01, rest 0.01)
myhost:101952:102088 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:102088 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
myhost:101952:103286 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 08/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 09/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 10/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 11/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 12/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 13/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 14/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 15/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 16/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 17/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 18/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 19/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 20/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 21/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 22/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 23/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 24/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 25/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 26/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 27/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 28/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 29/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 30/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:101952:103286 [0] NCCL INFO Channel 31/1 : 0[0] -> 1[1] via P2P/CUMEM

Note that in my cluster there's a plugin for SHARP, even though I'm leaving NCCL_COLLNET_ENABLE unset (hence disabled?).

When I set NCCL_COMM_BLOCKING=1 the segfault doesn't seem to occur.

@kiskra-nvidia
Copy link
Member

Could you try if running with the following environment variables set prevents the segfault?

NCCL_CUMEM_HOST_ENABLE=1
NCCL_COMM_SPLIT_SHARE_RESOURCES=0

@lw
Copy link
Author

lw commented Feb 17, 2025

The segfault keeps happening even if I set those two environment variables. I can confirm that it's still caused by the value of *pp being equal to 0x1.

Here are the logs of this new run:

myhost:2261199:2261199 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
myhost:2261199:2261199 [0] NCCL INFO Bootstrap : Using eth0:10.4.167.14<0>
myhost:2261199:2261199 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
myhost:2261199:2261199 [0] NCCL INFO Comm config Blocking set to 0
myhost:2261199:2261199 [0] NCCL INFO NCCL_COMM_SPLIT_SHARE_RESOURCES set by environment to 0.
myhost:2261199:2261270 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
myhost:2261199:2261270 [0] NCCL INFO P2P plugin v8 IBext_v8
myhost:2261199:2261270 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
myhost:2261199:2261270 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
myhost:2261199:2261270 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [4]ibp4:1/IB/SHARP [5]ibp5:1/IB/SHARP [6]ibp6:1/IB/SHARP [7]ibp7:1/IB/SHARP [RO]; OOB eth0:10.4.167.14<0>
myhost:2261199:2261270 [0] NCCL INFO Using non-device net plugin version 0
myhost:2261199:2261270 [0] NCCL INFO Using network IBext_v8
myhost:2261199:2261270 [0] NCCL INFO ncclCommInitRank comm 0x5951213c0f90 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1a000 commId 0xb1e56613d570a839 - Init START
myhost:2261199:2261270 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:2261199:2261270 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:2261199:2261270 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
myhost:2261199:2261270 [0] NCCL INFO comm 0x5951213c0f90 rank 0 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
myhost:2261199:2261270 [0] NCCL INFO Channel 00/08 :    0   1   2   3
myhost:2261199:2261270 [0] NCCL INFO Channel 01/08 :    0   3   2   1
myhost:2261199:2261270 [0] NCCL INFO Channel 02/08 :    0   1   2   3
myhost:2261199:2261270 [0] NCCL INFO Channel 03/08 :    0   3   2   1
myhost:2261199:2261270 [0] NCCL INFO Channel 04/08 :    0   1   2   3
myhost:2261199:2261270 [0] NCCL INFO Channel 05/08 :    0   3   2   1
myhost:2261199:2261270 [0] NCCL INFO Channel 06/08 :    0   1   2   3
myhost:2261199:2261270 [0] NCCL INFO Channel 07/08 :    0   3   2   1
myhost:2261199:2261270 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/2/-1->0->-1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->2 [5] -1/-1/-1->0->1 [6] 1/-1/-1->0->2 [7] -1/-1/-1->0->1
myhost:2261199:2261270 [0] NCCL INFO P2P Chunksize set to 131072
myhost:2261199:2261270 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 02/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 04/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 06/0 : 3[1] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 03/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[1] [send] via NET/IBext_v8/1(1)/GDRDMA
myhost:2261199:2261287 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
myhost:2261199:2261270 [0] NCCL INFO Connected all rings
myhost:2261199:2261270 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261270 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 02/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 04/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 06/0 : 2[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 02/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 04/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Channel 06/0 : 0[0] -> 2[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261270 [0] NCCL INFO Connected all trees
myhost:2261199:2261270 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myhost:2261199:2261270 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
myhost:2261199:2261270 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead.
myhost:2261199:2261270 [0] NCCL INFO ncclCommInitRank comm 0x5951213c0f90 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1a000 commId 0xb1e56613d570a839 - Init COMPLETE
myhost:2261199:2261199 [0] NCCL INFO Comm config Blocking set to 0
myhost:2261199:2261301 [0] NCCL INFO Using non-device net plugin version 0
myhost:2261199:2261301 [0] NCCL INFO Using network IBext_v8
myhost:2261199:2261301 [0] NCCL INFO bootstrapSplit: comm 0x595122703ca0 parent 0x5951213c0f90 rank 0 nranks 2 color 1684518034 key 0 prev 2 next 2 - DONE
myhost:2261199:2261301 [0] NCCL INFO ncclCommSplit comm 0x595122703ca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x5951213c0f90 color 1684518034 key 0 commId 0x5a16c3838ce6f1f8 - Init START
myhost:2261199:2261301 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:2261199:2261301 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:2261199:2261301 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 1.
myhost:2261199:2261301 [0] NCCL INFO comm 0x595122703ca0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
myhost:2261199:2261301 [0] NCCL INFO Channel 00/04 :    0   1
myhost:2261199:2261301 [0] NCCL INFO Channel 01/04 :    0   1
myhost:2261199:2261301 [0] NCCL INFO Channel 02/04 :    0   1
myhost:2261199:2261301 [0] NCCL INFO Channel 03/04 :    0   1
myhost:2261199:2261301 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
myhost:2261199:2261301 [0] NCCL INFO P2P Chunksize set to 131072
myhost:2261199:2261301 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Connected all rings
myhost:2261199:2261301 [0] NCCL INFO Connected all trees
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/0 : 0 [receive] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261302 [0] NCCL INFO SHARP rank 0/2 initialized on ibp0:1
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/1 : 0 [send] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/0 : 0 [receive] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/1 : 0 [send] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/0 : 0 [receive] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/1 : 0 [send] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/0 : 0 [receive] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO CollNet 00/1 : 0 [send] via COLLNET/SHARP/0/GDRDMA
myhost:2261199:2261301 [0] NCCL INFO Collnet Chains  [0] -1->0->2 [1] -1->0->2 [2] -1->0->2 [3] -1->0->2
myhost:2261199:2261301 [0] NCCL INFO Connected collnet + chain
myhost:2261199:2261301 [0] NCCL INFO rank 0 Connected CollNet
myhost:2261199:2261301 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
myhost:2261199:2261301 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myhost:2261199:2261301 [0] NCCL INFO ncclCommSplit comm 0x595122703ca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x5951213c0f90 color 1684518034 key 0 commId 0x5a16c3838ce6f1f8 - Init COMPLETE
myhost:2261199:2261199 [0] NCCL INFO Comm config Blocking set to 0
myhost:2261199:2261318 [0] NCCL INFO Using non-device net plugin version 0
myhost:2261199:2261318 [0] NCCL INFO Using network IBext_v8
myhost:2261199:2261318 [0] NCCL INFO bootstrapSplit: comm 0x59512271b310 parent 0x5951213c0f90 rank 0 nranks 2 color 2130503744 key 0 prev 1 next 1 - DONE
myhost:2261199:2261318 [0] NCCL INFO ncclCommSplit comm 0x59512271b310 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x5951213c0f90 color 2130503744 key 0 commId 0x133beb5545aa6c92 - Init START
myhost:2261199:2261318 [0] NCCL INFO MNNVL busId 0x1a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
myhost:2261199:2261318 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,55555555,55555555,55555555
myhost:2261199:2261318 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0.
myhost:2261199:2261318 [0] NCCL INFO comm 0x59512271b310 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
myhost:2261199:2261318 [0] NCCL INFO Channel 00/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 01/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 02/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 03/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 04/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 05/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 06/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 07/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 08/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 09/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 10/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 11/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 12/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 13/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 14/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 15/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 16/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 17/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 18/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 19/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 20/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 21/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 22/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Channel 23/24 :    0   1
myhost:2261199:2261318 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
myhost:2261199:2261318 [0] NCCL INFO P2P Chunksize set to 524288
myhost:2261199:2261318 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2261318 [0] NCCL INFO Connected all rings
myhost:2261199:2261318 [0] NCCL INFO Connected all trees
myhost:2261199:2261318 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
myhost:2261199:2261318 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
myhost:2261199:2261318 [0] NCCL INFO ncclCommSplit comm 0x59512271b310 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1a000 parent 0x5951213c0f90 color 2130503744 key 0 commId 0x133beb5545aa6c92 - Init COMPLETE
myhost:2261199:2262973 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 08/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 09/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 10/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 11/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 12/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 13/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 14/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 15/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 16/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 17/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 18/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 19/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 20/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 21/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 22/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 23/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 24/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 25/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 26/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 27/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 28/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 29/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 30/1 : 0[0] -> 1[1] via P2P/CUMEM
myhost:2261199:2262973 [0] NCCL INFO Channel 31/1 : 0[0] -> 1[1] via P2P/CUMEM

@abcdabcd987
Copy link

Same issue here! My code works fine on PyTorch 2.5 + Cuda 12.4. After bumping to PyTorch 2.6 + Cuda 12.6, the code breaks when running all_gather over a subgroup.

I looked at the core dump file with gdb, indeed *pp is 0x1.

Core was generated by `/opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/bin/python -c from multiprocess'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6fe5292b63 in ncclGroupCommJoin (comm=0x55dadecf4690) at include/group.h:113
113         while (*pp != nullptr && comm->intraComm0 != (*pp)->intraComm0)
[Current thread is 1 (Thread 0x7f7031f22740 (LWP 1518001))]

(gdb) bt
#0  0x00007f6fe5292b63 in ncclGroupCommJoin (comm=0x55dadecf4690) at include/group.h:113
#1  taskAppend (info=0x7ffe8aaac3e0, comm=0x55dadecf4690) at enqueue.cc:2152
#2  ncclEnqueueCheck (info=info@entry=0x7ffe8aaac3e0) at enqueue.cc:2224
#3  0x00007f6fe5285096 in ncclAllGather (sendbuff=0xa12400600, recvbuff=0xa12400800, sendcount=1, datatype=ncclFloat32, comm=0x55dadecf4690, stream=<optimized out>) at collectives.cc:88
#4  0x00007f700d06da42 in c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) () from /opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
#5  0x00007f7027f40758 in c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) ()
   from /opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007f7027f5019c in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) ()
   from /opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007f7027619840 in torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) () from /opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f7027f68b62 in c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) () from /opt/dlami/nvme/lequn/miniforge3-nvme/envs/rose/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
...
(gdb) p pp
$3 = (ncclComm **) 0x55dadecf2cb0
(gdb) p *pp
$4 = (ncclComm *) 0x1

@abcdabcd987
Copy link

Here's a short gist to reproduce the problem. Interestingly, the behavior is different on NCCL 2.21.5 (official PyTorch wheel) versus NCCL 2.25.1 (custom built PyTorch using system NCCL):

https://gist.github.com/abcdabcd987/094b2b8c4015da420ae37655d04431f0

@kiskra-nvidia
Copy link
Member

This looks like a bug we've encountered and already fixed for the next NCCL release. Could you try the following patch?

--- a/src/init.cc
+++ b/src/init.cc
@@ -315,11 +315,6 @@ ncclResult_t ncclCommEnsureReady(ncclComm_t comm) {
       if (ret == ncclInProgress) ret = ncclInvalidArgument;
       goto exit;
     }
-    /* if there is linked group job, we should complete it. */
-    if (comm->groupJob) {
-      NCCLCHECK(ncclGroupJobComplete(comm->groupJob));
-      comm->groupJob = NULL;
-    }
   }
 
 exit:
@@ -2215,6 +2210,11 @@ ncclResult_t ncclCommGetAsyncError(ncclComm_t comm, ncclResult_t *asyncError) {
 
   *asyncError = __atomic_load_n(&comm->asyncResult, __ATOMIC_ACQUIRE);
   if (*asyncError == ncclSuccess && comm->proxyState) *asyncError = __atomic_load_n(&comm->proxyState->asyncResult, __ATOMIC_ACQUIRE);
+  /* if there is linked group job, we should complete it. */
+  if (*asyncError == ncclSuccess && comm->groupJob) {
+    NCCLCHECK(ncclGroupJobComplete(comm->groupJob));
+    comm->groupJob = NULL;
+  }
   return ncclSuccess;
 }

@lw
Copy link
Author

lw commented Feb 25, 2025

@kiskra-nvidia Thanks for your answer! I tried your patch and rebuilt PyTorch+NCCL and indeed the segfault seems to have disappeared!

Do you have an ETA for the next NCCL version? Would it be possible to publish a micro release for the 2.25 one? (this would be easier to upgrade in PyTorch). Thanks!

@kiskra-nvidia
Copy link
Member

I'm glad to hear that the fix worked!

We have no plans to make another 2.25 release at this point. If it helps, we could commit the above patch on a branch in the github repo.

The next NCCL version is in testing but it's too early in the process to give ETAs.

@lw
Copy link
Author

lw commented Feb 26, 2025

These days PyTorch doesn't build its own NCCL and instead just depends on the official PyPI package, hence we need an actual new version. We'll wait for the one that's already in testing, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants