Use NCCL and toggle libuv for compatibility with latest pytorch versions #947
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This changes the backend from GLOO to NCCL for macOS and Linux targets using CUDA and keeping GLOO for Windows using any GPU since NCCL is not compiled for it. This also conditionally disables the use of libuv for Windows since pytorch >2.4 is compiled without libuv.
https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html#impact
This does not break compatibility with ROCm since ROCm pytorch uses RCCL under the hood:
https://discuss.pytorch.org/t/ddp-with-amd-rocm/118928/2
Motivation and Context
This allows for the usage of latest PyTorch versions and keeping distributed training on Windows while using NCCL on other OS to improve multi-gpu CUDA performance on Linux runners.
How has this been tested?
Run the code with Nvidia GPU on Windows and Linux. The code runs with PyTorch >2.4 on Windows, previously failed to run due to libuv error.
Types of changes
Checklist: