Different version of `cuda` used for `nccl` and torch compilation
tangjiangling opened this issue · 1 comments
Problem Description
Since the precompiled version of torch
manages some dependencies on its own (e.g. cuda
, nccl
, cudnn
), when we install torch
via pip
, let's say I want to install version 2.2.0
of torch
:
$> pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
It will go and download nvidia-nccl-cu12==2.19.3
as shown in the following log:
Collecting nvidia-nccl-cu12==2.19.3 (from torch==2.1.0.mt20240224+cu121)
Downloading https://download.pytorch.org/whl/cu121/nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)
So here's the issue: the nccl
downloaded here is compiled using cuda12.3
, while torch
uses cuda12.1
.
Although the compilation uses inconsistent versions, it actually works (at least I haven't had any problems so far), so I thought I'd ask here if this inconsistency could be hiding some problems I'm not aware of.
By the way, we can use nccl-tests
to verify the version of cuda
used by the nccl
compilation:
$> export LD_LIBRARY_PATH=/usr/local/conda/lib/python3.9/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH
$> git clone \
--recursive \
--branch v2.13.6 \
--single-branch \
--depth 1 \
https://github.com/NVIDIA/nccl-tests.git
$> cd nccl-tests
$> make -j16
$> NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
NCCL_DEBUG=INFO
this option prints out the version of cuda
used at compile time when nccl-tests
runs:
...
NCCL INFO cudaDriverVersion 12010
NCCL version 2.19.3+cuda12.3
...
now that containers are mainstream, it would be great to move off of python packaging for NVIDIA artifacts and instead install them on the system (i.e. - in the container, not in a conda environment, virtualenv, etc.)