You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am developing a custom NCCL plugin leveraging GDR with RDMA. My plugin works well under nccl-tests alltoall, but would hang under allreduce. I am using this testing script:
My test is very simply, just a 1KB allreduce among two GPUs. From the log I got below, I see that in the very beginning, NCCL seems issuing two irecv targeting the same GPU memory address; then it keeps polling but cannot succeed because the peer only issues one isend. So I wonder why NCCL would issue seemingly duplicated irecv?
Even if I did a quick hack to deduplicate the two irecv to let the second irecv's polling directly returns, NCCL would still hang there with no irecv, isend or itest issued. Any thoughts on where I might be wrong?
Best,
Yang
The text was updated successfully, but these errors were encountered:
Hi NCCL maintainer,
I am developing a custom NCCL plugin leveraging GDR with RDMA. My plugin works well under nccl-tests alltoall, but would hang under allreduce. I am using this testing script:
My test is very simply, just a 1KB allreduce among two GPUs. From the log I got below, I see that in the very beginning, NCCL seems issuing two irecv targeting the same GPU memory address; then it keeps polling but cannot succeed because the peer only issues one isend. So I wonder why NCCL would issue seemingly duplicated irecv?
Even if I did a quick hack to deduplicate the two irecv to let the second irecv's polling directly returns, NCCL would still hang there with no irecv, isend or itest issued. Any thoughts on where I might be wrong?
Best,
Yang
The text was updated successfully, but these errors were encountered: