-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when using ncclCommSplit with multiple threads and non-blocking init #1605
Comments
Could you try if running with the following environment variables set prevents the segfault?
|
The segfault keeps happening even if I set those two environment variables. I can confirm that it's still caused by the value of Here are the logs of this new run:
|
Same issue here! My code works fine on PyTorch 2.5 + Cuda 12.4. After bumping to PyTorch 2.6 + Cuda 12.6, the code breaks when running all_gather over a subgroup. I looked at the core dump file with gdb, indeed
|
Here's a short gist to reproduce the problem. Interestingly, the behavior is different on NCCL 2.21.5 (official PyTorch wheel) versus NCCL 2.25.1 (custom built PyTorch using system NCCL): https://gist.github.com/abcdabcd987/094b2b8c4015da420ae37655d04431f0 |
This looks like a bug we've encountered and already fixed for the next NCCL release. Could you try the following patch?
|
@kiskra-nvidia Thanks for your answer! I tried your patch and rebuilt PyTorch+NCCL and indeed the segfault seems to have disappeared! Do you have an ETA for the next NCCL version? Would it be possible to publish a micro release for the 2.25 one? (this would be easier to upgrade in PyTorch). Thanks! |
I'm glad to hear that the fix worked! We have no plans to make another 2.25 release at this point. If it helps, we could commit the above patch on a branch in the github repo. The next NCCL version is in testing but it's too early in the process to give ETAs. |
These days PyTorch doesn't build its own NCCL and instead just depends on the official PyPI package, hence we need an actual new version. We'll wait for the one that's already in testing, thanks! |
I have 4 GPUs (2 nodes with 2 GPUs each) and I'm initializing a global communicator, doing a "barrier" (all-reduce on 4 bytes followed by CUDA device sync), splitting a cross-node "network rail" communicator (ranks 0<->2 and 1<->3), another barrier, splitting a intra-node communicator (ranks 0<->1 and 2<->3) and another barrier. However, when I do an all-gather shortly after, I encounter a segfault in
ncclGroupCommJoin
.I'm using NCCL via PyTorch. I initially encountered this issue on v2.21.5 (shipped with PyTorch), but I can also repro it on v2.25.1 (coming from
nvidia-nccl-cu12
from PyPI). This problem only seems to occur when I use PyTorch's new "bound device id" feature, which initializes NCCL eagerly and non-blocking, and allows PyTorch to usencclCommSplit
to create subgroups.I noticed that the segfault happens during the backward pass, which PyTorch executes on a separate thread for autograd. This likely means that all the all-gathers during the forward (in the main thread) were successful.
Here is the information I could gather with GDB (let me know if I can get you more details, I can repro this reliably):
The segfault thus likely occurs when accessing
(*pp)->intraComm0
.Here is the output of NCCL_DEBUG=INFO on rank 0 (note that all ranks hit the segfault at the exact same time):
Note that in my cluster there's a plugin for SHARP, even though I'm leaving NCCL_COLLNET_ENABLE unset (hence disabled?).
When I set NCCL_COMM_BLOCKING=1 the segfault doesn't seem to occur.
The text was updated successfully, but these errors were encountered: