NCCL backend fails during multi-node, multi-GPU training
raketenolli opened this issue · 0 comments
raketenolli commented
Bug description
I set up a training on a Slurm cluster, specifying 2 nodes with 4 GPUs each. During initialization, I observed the Unexpected behavior (times out) of all_gather_into_tensor with subgroups (Pytorch issue)
Apparently, this issue has not been solved on the Pytorch or NCCL level, but there is a workaround (described in this post on that same issue).
How/where could this workaround be implemented in Pytorch Lightning, if outright solving the underlying problem is not possible?
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
I'm working on a Slurm cluster with 2 headnodes (no GPUs), 6 computenodes (configuration see below) and NFS-mounted data storage.
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 12.1
* Lightning:
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.1+cu121
- torchmetrics: 1.4.2
- torchvision: 0.19.1+cu121
* Packages:
- absl-py: 2.1.0
- aiohappyeyeballs: 2.4.0
- aiohttp: 3.10.5
- aiosignal: 1.3.1
- albucore: 0.0.16
- albumentations: 1.4.15
- annotated-types: 0.7.0
- async-timeout: 4.0.3
- attrs: 24.2.0
- certifi: 2024.8.30
- charset-normalizer: 3.3.2
- contourpy: 1.3.0
- cycler: 0.12.1
- eval-type-backport: 0.2.0
- filelock: 3.13.1
- fonttools: 4.53.1
- frozenlist: 1.4.1
- fsspec: 2024.2.0
- future: 1.0.0
- geopandas: 1.0.1
- grpcio: 1.66.1
- huggingface-hub: 0.25.0
- idna: 3.10
- imageio: 2.35.1
- imgaug: 0.4.0
- jinja2: 3.1.3
- joblib: 1.4.2
- kiwisolver: 1.4.7
- lazy-loader: 0.4
- lightning-utilities: 0.11.7
- markdown: 3.7
- matplotlib: 3.9.2
- mpmath: 1.3.0
- msgpack: 1.1.0
- multidict: 6.1.0
- networkx: 3.2.1
- numpy: 1.26.3
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.1.105
- nvidia-nvtx-cu12: 12.1.105
- opencv-python: 4.10.0.84
- opencv-python-headless: 4.10.0.84
- packaging: 24.1
- pandas: 2.2.2
- pillow: 10.2.0
- pip: 22.3.1
- protobuf: 5.28.1
- pydantic: 2.9.2
- pydantic-core: 2.23.4
- pyogrio: 0.9.0
- pyparsing: 3.1.4
- pyproj: 3.6.1
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pytz: 2024.2
- pyyaml: 6.0.2
- requests: 2.32.3
- s2sphere: 0.2.5
- safetensors: 0.4.5
- scikit-image: 0.24.0
- scikit-learn: 1.5.2
- scipy: 1.14.1
- setuptools: 65.5.0
- shapely: 2.0.6
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.17.1
- tensorboard-data-server: 0.7.2
- threadpoolctl: 3.5.0
- tifffile: 2024.8.30
- timm: 1.0.9
- torch: 2.4.1+cu121
- torchmetrics: 1.4.2
- torchvision: 0.19.1+cu121
- tqdm: 4.66.5
- triton: 3.0.0
- typing-extensions: 4.9.0
- tzdata: 2024.1
- urllib3: 2.2.3
- werkzeug: 3.0.4
- yarl: 1.11.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.9
- release: 5.15.0-50-generic
- version: #56~20.04.1-Ubuntu SMP Tue Sep 27 15:51:29 UTC 2022
</details>
More info
No response