torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Question

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Closed this issue 8 months ago · 2 comments

Hi, I encountered the issue:
I'm running the minimal script on A100 80g GPUs, what could be the potential reason for this error? thx

Loading pipeline components...: 100%|█| 7/7 [00:01<00:00,  6.
Loading pipeline components...: 100%|█| 7/7 [00:01<00:00,  6.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
100%|████████████████████████| 50/50 [00:01<00:00, 40.67it/s]
[rank0]:[E ProcessGroupNCCL.cpp:1025] [PG 0 Rank 0] Future for ProcessGroup abort timed out after 600000 ms
[rank1]:[E ProcessGroupNCCL.cpp:1025] [PG 0 Rank 1] Future for ProcessGroup abort timed out after 600000 ms
[rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
E0428 02:15:04.210000 140620742133568 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 2732148) of binary: /home/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/distrifuser/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
scripts/sd_example.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-28_02:15:04
  host      : xxx
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2732149)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2732149
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-28_02:15:04
  host      : xxx
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 2732148)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2732148
========================================================

Answer 1 · 2024-04-28T23:34:14.000Z

Could you provide the detailed running command you used? Also, our code is tested on PyTorch 2.2. Could you switch to that version?

Answer 2 · 2024-04-30T01:22:27.000Z

Thx! I solved this by downgrade PyTorch 2.3->2.2