Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allreduce issues duplicated irecv under network plugin #1626

Open
YangZhou1997 opened this issue Mar 3, 2025 · 0 comments
Open

allreduce issues duplicated irecv under network plugin #1626

YangZhou1997 opened this issue Mar 3, 2025 · 0 comments

Comments

@YangZhou1997
Copy link

Hi NCCL maintainer,

I am developing a custom NCCL plugin leveraging GDR with RDMA. My plugin works well under nccl-tests alltoall, but would hang under allreduce. I am using this testing script:

    mpirun --bind-to none -np 2 -N 1 --host 10.0.0.1,10.0.0.2 \
        --mca plm_rsh_args "-o StrictHostKeyChecking=no" \
        --mca orte_base_help_aggregate 0 \
        -x LD_PRELOAD="${LIBNCCL_PATH} ${PLUGIN_PATH}" \
        -x NCCL_DEBUG=INFO \
        -x NCCL_P2P_DISABLE=1 \
        -x NCCL_SHM_DISABLE=1 \
        -x NCCL_NET_DISABLE=0 \
        -x NCCL_MAX_NCHANNELS=1 \
        -x NCCL_MIN_NCHANNELS=1 \
        -x NCCL_P2P_NET_CHUNKSIZE=524288 \
        -x NCCL_BUFFSIZE=8388608 \
        all_reduce_perf \
        -b 1K -e 1K -f 2 -g 1 -w 0 -n 1 -t 1

My test is very simply, just a 1KB allreduce among two GPUs. From the log I got below, I see that in the very beginning, NCCL seems issuing two irecv targeting the same GPU memory address; then it keeps polling but cannot succeed because the peer only issues one isend. So I wonder why NCCL would issue seemingly duplicated irecv?

Image

Even if I did a quick hack to deduplicate the two irecv to let the second irecv's polling directly returns, NCCL would still hang there with no irecv, isend or itest issued. Any thoughts on where I might be wrong?

Best,
Yang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant