pytorch/builder

Different version of `cuda` used for `nccl` and torch compilation

tangjiangling opened this issue · 1 comments

Problem Description

Since the precompiled version of torch manages some dependencies on its own (e.g. cuda, nccl, cudnn), when we install torch via pip, let's say I want to install version 2.2.0 of torch:

$> pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121 

It will go and download nvidia-nccl-cu12==2.19.3 as shown in the following log:

Collecting nvidia-nccl-cu12==2.19.3 (from torch==2.1.0.mt20240224+cu121)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)

So here's the issue: the nccl downloaded here is compiled using cuda12.3, while torch uses cuda12.1.

Although the compilation uses inconsistent versions, it actually works (at least I haven't had any problems so far), so I thought I'd ask here if this inconsistency could be hiding some problems I'm not aware of.

By the way, we can use nccl-tests to verify the version of cuda used by the nccl compilation:

$> export LD_LIBRARY_PATH=/usr/local/conda/lib/python3.9/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH
$> git clone \
    --recursive \
    --branch v2.13.6 \
    --single-branch \
    --depth 1 \
    https://github.com/NVIDIA/nccl-tests.git
$> cd nccl-tests
$> make -j16
$> NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

NCCL_DEBUG=INFO this option prints out the version of cuda used at compile time when nccl-tests runs:

...
NCCL INFO cudaDriverVersion 12010
NCCL version 2.19.3+cuda12.3
...

now that containers are mainstream, it would be great to move off of python packaging for NVIDIA artifacts and instead install them on the system (i.e. - in the container, not in a conda environment, virtualenv, etc.)