NCCL initialization hangs with 4 GPUs, but works with 2 GPUs

Question

NCCL initialization hangs with 4 GPUs, but works with 2 GPUs

mickaelseznec opened this issue 7 months ago · 4 comments

Hi 👋 ,

When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL.

The command is executed within docker, on a 8xH100 machine. We've successfully tried simpleP2P so all GPUs seem to be working. The issue seems to be laying in NCCL (we're using 2.19.3).

Here are the log for 2 GPUs: all_reduce_2_gpus.txt. It completes successfully and I don't see anything concerning in the logs.

For 4 GPUs: all_reduce_4_gpus.txt, the program hangs indefinitely. Final log line for all GPUs is something like

c2ce1c877ea3:1023:1032 [0] NCCL INFO NVLS Bind mem 0xac0000000 UC handle 0x7f3310ccb240 MC handle 0x7f3310ccaa20 size 1073741824

We've tried increasing shmem size with --shm-size=1g --ulimit memlock=-1 and various env settings like NCCL_SHM_DISABLE=1 or NCCL_ALGO=Tree.

Do you have any idea where to look next?

Thanks a lot :)

Answer 1 · 2024-05-22T12:05:21.000Z

Can you try with NCCL_NVLS_ENABLE=0?

Answer 2 · 2024-05-22T15:53:58.000Z

Thanks a lot @sjeaugey, the example is working now!

Any insights what the probable cause for NVLS not working? Looking at the docs, it seems that NCCL doesn't use NVLS when not available (and I also thought that setting NCCL_ALGO=Tree would disable NVLS as well).

Answer 3 · 2024-05-22T16:00:59.000Z

Ok thanks for confirming. But I'm not sure actually why NVLS Bind calls would hang. It's outside of our scope as those calls go to CUDA.

Answer 4 · 2024-05-23T07:19:23.000Z

Actually it could be because the fabricmanager service isn't running. Note that if you restart it, you may need to reset all GPUs to make NVLS functional again. Rebooting is usually the easiest option.