Azure/azhpc-images

NCCL graph and topology incompatible with A100

Closed this issue · 3 comments

I'm using the ubuntu-hpc 2204 x64 Gen 2 image on a Standard NC24ads A100 v4 VM.

I train a vLLM model that uses NCCL and observe the following error:

Error

::16674:16674 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
::16674:16674 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
::16674:16674 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
::16674:16674 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.6+cuda11.8
::16674:17023 [0] NCCL INFO NET/IB : Using [0]mlx5_an0:1/RoCE [RO]; OOB eth0:10.1.0.4<0>
::16674:17023 [0] NCCL INFO Using network IB
::16674:17023 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x2064c00fc2f91516 - Init START
::16674:17023 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml
::16674:17023 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
::16674:17023 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml

::16674:17023 [0] graph/search.cc:703 NCCL WARN XML Import Channel : dev 1 not found.
::16674:17023 [0] NCCL INFO graph/search.cc:733 -> 2
::16674:17023 [0] NCCL INFO graph/search.cc:740 -> 2
::16674:17023 [0] NCCL INFO graph/search.cc:840 -> 2
::16674:17023 [0] NCCL INFO init.cc:880 -> 2
::16674:17023 [0] NCCL INFO init.cc:1358 -> 2
::16674:17023 [0] NCCL INFO group.cc:65 -> 2 [Async thread]
::16674:16674 [0] NCCL INFO group.cc:406 -> 2
::16674:16674 [0] NCCL INFO group.cc:96 -> 2
Traceback (most recent call last):

...

self.llm_engine = LLMEngine.from_engine_args(engine_args)

File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
engine = cls(*engine_configs,
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_workers()
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 151, in _init_workers
self._run_workers("init_model")
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model
init_distributed_environment(self.parallel_config, self.rank,
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment
torch.distributed.all_reduce(torch.zeros(1).cuda())
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
work = group.allreduce([tensor], opts)

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1702400366987/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
XML Import Channel : dev 1 not found.
::16674:16674 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE

This is a single GPU machine, but /opt/microsoft/ncv4/graph.xml and topology.xml reference 4 GPUs. If I update them to refer to a single GPU, everything works.

graph.xml
<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="2" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="4" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="4" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
</graphs>
topology.xml
<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00ffffff" arch="x86_64" vendor="AuthenticAMD" familyid="175" modelid="1">
    <pci busid="0001:00:00.0" class="0x030200" vendor="0x10de" device="0x20b5" subsystem_vendor="0x10de" subsystem_device="0x1533" link_speed="" link_width="0">
      <gpu dev="0" sm="80" rank="0" gdr="1">
        <nvlink target="0002:00:00.0" count="12" tclass="0x030200"/>
      </gpu>
    </pci>
    <nic>
      <net name="eth0" dev="0" speed="100000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
</system>

Looking into it.

I found the same issue and fix with Standard_NC48ads_A100_v4 with these images.

Yes, the topo/graph files are not needed for the smaller NCv4 VM sizes. Next VM image release will have this not loaded automatically (in /etc/nccl.conf) for these vm sizes. Until then, please delete the reference to topo/graph files and it should be good.