We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver Version: 535.129.03
MscclppAllReduce3
Hi MSCCL++ team,
Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?
Thanks, --Saeed
The text was updated successfully, but these errors were encountered:
Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue. https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19
535.86.10
Sorry, something went wrong.
thanks @Binyang2014! Debugging this issue with nvidia.
Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.
535.154.05
it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.
Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.
No branches or pull requests
Hi MSCCL++ team,
Do you know if
Driver Version: 535.129.03
has a bug that makes AllReduce3 to timeout?Thanks,
--Saeed
The text was updated successfully, but these errors were encountered: