NCCL initialization hangs with 4 GPUs, but works with 2 GPUs
mickaelseznec opened this issue ยท 4 comments
Hi ๐ ,
When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL
.
The command is executed within docker, on a 8xH100 machine. We've successfully tried simpleP2P so all GPUs seem to be working. The issue seems to be laying in NCCL (we're using 2.19.3).
Here are the log for 2 GPUs: all_reduce_2_gpus.txt. It completes successfully and I don't see anything concerning in the logs.
For 4 GPUs: all_reduce_4_gpus.txt, the program hangs indefinitely. Final log line for all GPUs is something like
c2ce1c877ea3:1023:1032 [0] NCCL INFO NVLS Bind mem 0xac0000000 UC handle 0x7f3310ccb240 MC handle 0x7f3310ccaa20 size 1073741824
We've tried increasing shmem size with --shm-size=1g --ulimit memlock=-1
and various env settings like NCCL_SHM_DISABLE=1
or NCCL_ALGO=Tree
.
Do you have any idea where to look next?
Thanks a lot :)
Can you try with NCCL_NVLS_ENABLE=0
?
Thanks a lot @sjeaugey, the example is working now!
Any insights what the probable cause for NVLS not working? Looking at the docs, it seems that NCCL doesn't use NVLS when not available (and I also thought that setting NCCL_ALGO=Tree
would disable NVLS as well).
Ok thanks for confirming. But I'm not sure actually why NVLS Bind calls would hang. It's outside of our scope as those calls go to CUDA.
Actually it could be because the fabricmanager service isn't running. Note that if you restart it, you may need to reset all GPUs to make NVLS functional again. Rebooting is usually the easiest option.