microsoft/mscclpp

[Bug] Is there a known bug with `Driver Version: 535.129.03` which cases `MscclppAllReduce3` to hang?

saeedmaleki opened this issue · 5 comments

Hi MSCCL++ team,

Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?

Thanks,
--Saeed

Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue.
https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19

thanks @Binyang2014! Debugging this issue with nvidia.

Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.

it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.

Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.