[Bug] Is there a known bug with `Driver Version: 535.129.03` which cases `MscclppAllReduce3` to hang?
saeedmaleki opened this issue · 5 comments
Hi MSCCL++ team,
Do you know if Driver Version: 535.129.03
has a bug that makes AllReduce3 to timeout?
Thanks,
--Saeed
Hmm... not tested based on this version. Azure hpc image using driver 535.86.10
and doesn't have this issue.
https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19
thanks @Binyang2014! Debugging this issue with nvidia.
Hi @saeedmaleki, is this issue resolved on your end? 535.154.05
is working good on my env.
it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.
Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.