torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Closed this issue · 2 comments
luminousking commented
Hi, I encountered the issue:
I'm running the minimal script on A100 80g GPUs, what could be the potential reason for this error? thx
Loading pipeline components...: 100%|█| 7/7 [00:01<00:00, 6.
Loading pipeline components...: 100%|█| 7/7 [00:01<00:00, 6.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
100%|████████████████████████| 50/50 [00:01<00:00, 40.67it/s]
[rank0]:[E ProcessGroupNCCL.cpp:1025] [PG 0 Rank 0] Future for ProcessGroup abort timed out after 600000 ms
[rank1]:[E ProcessGroupNCCL.cpp:1025] [PG 0 Rank 1] Future for ProcessGroup abort timed out after 600000 ms
[rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
E0428 02:15:04.210000 140620742133568 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 2732148) of binary: /home/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/distrifuser/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/distrifuser/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
scripts/sd_example.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2024-04-28_02:15:04
host : xxx
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 2732149)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2732149
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-28_02:15:04
host : xxx
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 2732148)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2732148
========================================================
lmxyy commented
Could you provide the detailed running command you used? Also, our code is tested on PyTorch 2.2. Could you switch to that version?
luminousking commented
Thx! I solved this by downgrade PyTorch 2.3->2.2