About Timeout when use Multi-gpu training

Question

About Timeout when use Multi-gpu training

Closed this issue a year ago · 0 comments

Hi Jiang,

Following your instructions to create the virtual environment, I got the following error when using four Telsa V100 GPUS for training:

[2023-05-17 10:22:10,548 INFO train.py line 223 37400] iter: 271/8000, lr: 2.3145e-03 loss: 1.5110(1.5110) data_time: 4.39(4.26) iter_time: 5.87(5.40) remain_time: 00:11:35:22
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805481 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47363, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805538 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805635 milliseconds before timing out.
Traceback (most recent call last):
File "train.py", line 339, in
train(cfg, model, model_fn, optimizer, scheduler, dataset, start_iter=start_iter) # start_iter from 0
File "train.py", line 252, in train
data_iterator_l, data_iterator_u, epoch_l, epoch_u, it_in_epoch_l, it_in_epoch_u = train_iter(
File "train.py", line 176, in train_iter
loss.backward()
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, args)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
torch.distributed.all_reduce(
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805538 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805635 milliseconds before timing out.
Traceback (most recent call last):
File "train.py", line 339, in
train(cfg, model, model_fn, optimizer, scheduler, dataset, start_iter=start_iter) # start_iter from 0
File "train.py", line 252, in train
data_iterator_l, data_iterator_u, epoch_l, epoch_u, it_in_epoch_l, it_in_epoch_u = train_iter(
File "train.py", line 176, in train_iter
loss.backward()
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, args)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
torch.distributed.all_reduce(
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47362, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805481 milliseconds before timing out.
Traceback (most recent call last):
File "train.py", line 339, in
train(cfg, model, model_fn, optimizer, scheduler, dataset, start_iter=start_iter) # start_iter from 0
File "train.py", line 252, in train
data_iterator_l, data_iterator_u, epoch_l, epoch_u, it_in_epoch_l, it_in_epoch_u = train_iter(
File "train.py", line 176, in train_iter
loss.backward()
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, args)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
torch.distributed.all_reduce(
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=47363, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805489 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 37400) of binary: /data/home/scv3159/.conda/envs/py38/bin/python
Traceback (most recent call last):
File "/data/home/scv3159/.conda/envs/py38/bin/torchrun", line 8, in
sys.exit(main())
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(args, kwargs)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/home/scv3159/.conda/envs/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-05-17_10:52:33
host : g0006.para.ai
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 37401)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 37401
[2]:
time : 2023-05-17_10:52:33
host : g0006.para.ai
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 37406)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-05-17_10:52:33
host : g0006.para.ai
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 37410)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-05-17_10:52:33
host : g0006.para.ai
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 37400)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

train.py FAILED

Root Cause (first observed failure): [0]: time : 2023-05-17_10:52:33 host : g0006.para.ai rank : 0 (local_rank: 0) exitcode : 1 (pid: 37400) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-05-17_10:52:33
host : g0006.para.ai
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 37400)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html