NCCL backend fails during multi-node, multi-GPU training

Question

NCCL backend fails during multi-node, multi-GPU training

raketenolli opened this issue 3 months ago · 0 comments

Bug description

I set up a training on a Slurm cluster, specifying 2 nodes with 4 GPUs each. During initialization, I observed the Unexpected behavior (times out) of all_gather_into_tensor with subgroups (Pytorch issue)

Apparently, this issue has not been solved on the Pytorch or NCCL level, but there is a workaround (described in this post on that same issue).

How/where could this workaround be implemented in Pytorch Lightning, if outright solving the underlying problem is not possible?

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

I'm working on a Slurm cluster with 2 headnodes (no GPUs), 6 computenodes (configuration see below) and NFS-mounted data storage.

<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
                - NVIDIA RTX A6000
        - available:         True
        - version:           12.1
* Lightning:
        - lightning-utilities: 0.11.7
        - pytorch-lightning: 2.4.0
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
* Packages:
        - absl-py:           2.1.0
        - aiohappyeyeballs:  2.4.0
        - aiohttp:           3.10.5
        - aiosignal:         1.3.1
        - albucore:          0.0.16
        - albumentations:    1.4.15
        - annotated-types:   0.7.0
        - async-timeout:     4.0.3
        - attrs:             24.2.0
        - certifi:           2024.8.30
        - charset-normalizer: 3.3.2
        - contourpy:         1.3.0
        - cycler:            0.12.1
        - eval-type-backport: 0.2.0
        - filelock:          3.13.1
        - fonttools:         4.53.1
        - frozenlist:        1.4.1
        - fsspec:            2024.2.0
        - future:            1.0.0
        - geopandas:         1.0.1
        - grpcio:            1.66.1
        - huggingface-hub:   0.25.0
        - idna:              3.10
        - imageio:           2.35.1
        - imgaug:            0.4.0
        - jinja2:            3.1.3
        - joblib:            1.4.2
        - kiwisolver:        1.4.7
        - lazy-loader:       0.4
        - lightning-utilities: 0.11.7
        - markdown:          3.7
        - matplotlib:        3.9.2
        - mpmath:            1.3.0
        - msgpack:           1.1.0
        - multidict:         6.1.0
        - networkx:          3.2.1
        - numpy:             1.26.3
        - nvidia-cublas-cu12: 12.1.3.1
        - nvidia-cuda-cupti-cu12: 12.1.105
        - nvidia-cuda-nvrtc-cu12: 12.1.105
        - nvidia-cuda-runtime-cu12: 12.1.105
        - nvidia-cudnn-cu12: 9.1.0.70
        - nvidia-cufft-cu12: 11.0.2.54
        - nvidia-curand-cu12: 10.3.2.106
        - nvidia-cusolver-cu12: 11.4.5.107
        - nvidia-cusparse-cu12: 12.1.0.106
        - nvidia-nccl-cu12:  2.20.5
        - nvidia-nvjitlink-cu12: 12.1.105
        - nvidia-nvtx-cu12:  12.1.105
        - opencv-python:     4.10.0.84
        - opencv-python-headless: 4.10.0.84
        - packaging:         24.1
        - pandas:            2.2.2
        - pillow:            10.2.0
        - pip:               22.3.1
        - protobuf:          5.28.1
        - pydantic:          2.9.2
        - pydantic-core:     2.23.4
        - pyogrio:           0.9.0
        - pyparsing:         3.1.4
        - pyproj:            3.6.1
        - python-dateutil:   2.9.0.post0
        - pytorch-lightning: 2.4.0
        - pytz:              2024.2
        - pyyaml:            6.0.2
        - requests:          2.32.3
        - s2sphere:          0.2.5
        - safetensors:       0.4.5
        - scikit-image:      0.24.0
        - scikit-learn:      1.5.2
        - scipy:             1.14.1
        - setuptools:        65.5.0
        - shapely:           2.0.6
        - six:               1.16.0
        - sympy:             1.12
        - tensorboard:       2.17.1
        - tensorboard-data-server: 0.7.2
        - threadpoolctl:     3.5.0
        - tifffile:          2024.8.30
        - timm:              1.0.9
        - torch:             2.4.1+cu121
        - torchmetrics:      1.4.2
        - torchvision:       0.19.1+cu121
        - tqdm:              4.66.5
        - triton:            3.0.0
        - typing-extensions: 4.9.0
        - tzdata:            2024.1
        - urllib3:           2.2.3
        - werkzeug:          3.0.4
        - yarl:              1.11.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.9
        - release:           5.15.0-50-generic
        - version:           #56~20.04.1-Ubuntu SMP Tue Sep 27 15:51:29 UTC 2022

</details>

More info

No response