Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to trigger several independent communications(e.g allgather) simultaneously? #1607

Open
Ind1x1 opened this issue Feb 16, 2025 · 5 comments

Comments

@Ind1x1
Copy link

Ind1x1 commented Feb 16, 2025

For example, in training with 4 GPUs, I divide the GPUs into pairs and create two communication groups: group1 = dist.new_group([0, 1]) and group2 = dist.new_group([2, 3]). If I want to run independent dist.all_gather operations within both communication groups simultaneously, it results in an error. I'd like to ask how to implement this correctly.

File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 209, in all_gather
    return torch.distributed.all_gather(tensor_list=tensor_list, tensor=tensor, group=group, async_op=async_op)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2617, in all_gather
    work = group.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to 192.168.1.91<48217> failed : Software caused connection abort
node06:1913795:1914481 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
node06:1913796:1914482 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
node06:1913795:1914481 [2] NCCL INFO Channel 00/04 :    0   1
node06:1913795:1914481 [2] NCCL INFO Channel 01/04 :    0   1
node06:1913795:1914481 [2] NCCL INFO Channel 02/04 :    0   1
node06:1913795:1914481 [2] NCCL INFO Channel 03/04 :    0   1
node06:1913795:1914481 [2] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
node06:1913795:1914481 [2] NCCL INFO P2P Chunksize set to 131072
node06:1913796:1914482 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
node06:1913796:1914482 [3] NCCL INFO P2P Chunksize set to 131072
node06:1913795:1914481 [2] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
node06:1913796:1914482 [3] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
node06:1913795:1914481 [2] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
node06:1913796:1914482 [3] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
node06:1913795:1914481 [2] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM
node06:1913796:1914482 [3] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/CUMEM
node06:1913795:1914481 [2] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM
node06:1913796:1914482 [3] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/CUMEM
node06:1913796:1914482 [3] NCCL INFO Connected all rings
node06:1913796:1914482 [3] NCCL INFO Connected all trees
node06:1913796:1914482 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node06:1913795:1914481 [2] NCCL INFO Connected all rings
node06:1913796:1914482 [3] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node06:1913795:1914481 [2] NCCL INFO Connected all trees
node06:1913795:1914481 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node06:1913795:1914481 [2] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node06:1913795:1914481 [2] NCCL INFO comm 0x1a9590b0 rank 0 nranks 2 cudaDev 2 nvmlDev 2 busId 6c000 commId 0xdd736563a6f28c07 - Init COMPLETE
node06:1913796:1914482 [3] NCCL INFO comm 0x1931a220 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 6d000 commId 0xdd736563a6f28c07 - Init COMPLETE
@kiskra-nvidia
Copy link
Member

The particular error you're seeing (socketStartConnect: Connect to 192.168.1.91<48217> failed : Software caused connection abort) should be fixed in current NCCL version. That said, it's not the sort of issue that one would expect to trigger randomly.

Creating non-overlapping communicators and running simultaneous collective operations on them is well supported in NCCL. We would need more context to understand what triggers the problem in your case. The small excerpt of the NCCL_DEBUG output you posted is not sufficient to explain it -- we would like to see the complete output, from start to finish.

@Ind1x1
Copy link
Author

Ind1x1 commented Feb 25, 2025

@kiskra-nvidia Thank you very much for your reply.
For example, in a training task with num_rank = 4, I inserted an additional communication after the model update, and I tried to perform an extra gradient synchronization. This exchange operation gather the gradients between GPU rank == 0 and GPU rank == 1, and between GPU rank == 2 and GPU rank == 3. To do this, I created a communication group for rank == 0/1 and another for rank == 2/3. When I try to gather using dist.all_gather with the NCCL backend, the dist.all_gather operations within the two communication groups should occur simultaneously, but an error output appeared. What I want to know is whether it is feasible for the dist.all_gather operations within these two communication groups to happen concurrently, and if it is, what should I add to ensure the program runs smoothly?

Error log:

node06:2260622:2261282 [1] NCCL INFO Using non-device net plugin version 0
node06:2260622:2261282 [1] NCCL INFO Using network Socket

node06:2260622:2261282 [1] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 192.168.0.91<2871> failed : Software caused connection abort
node06:2260622:2261282 [1] NCCL INFO misc/socket.cc:565 -> 2
node06:2260622:2261282 [1] NCCL INFO misc/socket.cc:619 -> 2
node06:2260622:2261282 [1] NCCL INFO bootstrap.cc:274 -> 2
node06:2260622:2261282 [1] NCCL INFO init.cc:1388 -> 2
node06:2260622:2261282 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
node06:2260622:2260622 [1] NCCL INFO group.cc:418 -> 2
node06:2260622:2260622 [1] NCCL INFO group.cc:95 -> 2
NCCL version 2.19.3+cuda12.3
node06:2260623:2261284 [2] NCCL INFO Using non-device net plugin version 0
node06:2260623:2261284 [2] NCCL INFO Using network Socket
Traceback (most recent call last):
  File "/home/yeleyi/ttst/test.py", line 154, in <module>
    main()
  File "/home/yeleyi/ttst/test.py", line 138, in main
    model_engine.step()
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2244, in step
    self._take_model_step(lr_kwargs)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2147, in _take_model_step
    self.optimizer.step()
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2265, in step
    self.zoetic_all_gather_grad()
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2312, in zoetic_all_gather_grad
    dist.all_gather(self.zoetic_buffer[index],tensor[offset:offset+block_size],group=self.vertin_checkpoint_group)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 236, in all_gather
    return cdb.all_gather(tensor_list=tensor_list, tensor=tensor, group=group, async_op=async_op)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 209, in all_gather
    return torch.distributed.all_gather(tensor_list=tensor_list, tensor=tensor, group=group, async_op=async_op)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/yeleyi/anaconda3/envs/torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2617, in all_gather
    work = group.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to 192.168.0.91<2871> failed : Software caused connection abort
node06:2260624:2261281 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
node06:2260621:2261280 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
node06:2260624:2261281 [3] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
node06:2260624:2261281 [3] NCCL INFO P2P Chunksize set to 131072
node06:2260621:2261280 [0] NCCL INFO Channel 00/04 :    0   1
node06:2260621:2261280 [0] NCCL INFO Channel 01/04 :    0   1
node06:2260621:2261280 [0] NCCL INFO Channel 02/04 :    0   1
node06:2260621:2261280 [0] NCCL INFO Channel 03/04 :    0   1
node06:2260621:2261280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
node06:2260621:2261280 [0] NCCL INFO P2P Chunksize set to 131072
node06:2260624:2261281 [3] NCCL INFO Channel 00/0 : 1[3] -> 0[0] via P2P/CUMEM
node06:2260624:2261281 [3] NCCL INFO Channel 01/0 : 1[3] -> 0[0] via P2P/CUMEM
node06:2260624:2261281 [3] NCCL INFO Channel 02/0 : 1[3] -> 0[0] via P2P/CUMEM
node06:2260621:2261280 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[3] via P2P/CUMEM
node06:2260624:2261281 [3] NCCL INFO Channel 03/0 : 1[3] -> 0[0] via P2P/CUMEM
node06:2260621:2261280 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[3] via P2P/CUMEM
node06:2260621:2261280 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[3] via P2P/CUMEM
node06:2260621:2261280 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[3] via P2P/CUMEM
node06:2260621:2261280 [0] NCCL INFO Connected all rings
node06:2260621:2261280 [0] NCCL INFO Connected all trees
node06:2260624:2261281 [3] NCCL INFO Connected all rings
node06:2260624:2261281 [3] NCCL INFO Connected all trees
node06:2260624:2261281 [3] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node06:2260624:2261281 [3] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node06:2260621:2261280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node06:2260621:2261280 [0] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node06:2260621:2261280 [0] NCCL INFO comm 0x25003c40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 67000 commId 0x1ab8d91d56d4e568 - Init COMPLETE
node06:2260624:2261281 [3] NCCL INFO comm 0x24f0a960 rank 1 nranks 2 cudaDev 3 nvmlDev 3 busId 6d000 commId 0x1ab8d91d56d4e568 - Init COMPLETE

Implementation:

....
self.free_grad_in_param_list(self.params_in_partition[i])
self.averaged_gradients[i] = None
self.unscale_and_clip_grads([single_grad_partition], scaled_global_grad_norm)
self.timers(OPTIMIZER_GRADIENTS_TIMER).stop()
# Step 3:- run the optimizer if no offloading
self.timers(OPTIMIZER_STEP_TIMER).start()
self._optimizer_step(i)
 # Step 4:- get rid of the fp32 gradients. Not needed anymore
self.single_partition_of_fp32_groups[i].grad = None
# del single_grad_partition

# The above implementation achieves model updates for DP=4, i.e., data parallel training with 4 GPUs.
self.all_gather_grad(single_grad_partition)
# 
# GPU0 <--single_grad_partition--> GPU1 ; GPU2 <--single_grad_partition--> GPU3

Comm function:

def all_gather_grad(self, grad):

        tensor_list = [grad.clone() for _ in range(2)]

        dist.all_gather(tensor_list, grad, group=self.gather_checkpoint_group)

@kiskra-nvidia
Copy link
Member

It's still not enough NCCL output to tell for sure but my guess is that the communicators are not being initialized properly. Most likely a stale ncclUniqueId is being used (i.e., the value is not being updated/propagated correctly for different communicators).

Whether that's due to a problem in your code or in PyTorch, I don't know. My recommendation would be to ask the PyTorch community for help...

@Ind1x1
Copy link
Author

Ind1x1 commented Feb 26, 2025

@kiskra-nvidia Thank you for your response. Does this mean that the issue should be due to a bug in the code (incorrect NCCL call) rather than the communication backend not allowing two independent communications? I will try to submit the issue to the PyTorch community.

@kiskra-nvidia
Copy link
Member

Yes, my suspicion is that the bug is in a software layer above NCCL -- whether in PyTorch or in the app, I can't tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants