Multi GPU process stuck
xwyzsn opened this issue · 6 comments
System Info
- `Accelerate` version: 0.24.1
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.11
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 125.41 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: True
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
my script is below,I removed some unnecessary code. If this code snippet doesn't help resolve the issue, I will provide the entire code repository.
vali_loss_collect=[]
class Trainer(object):
def __init__(self):
self.train_loader = #
self.vali_loader = #
self.test_loader = #
self.device = accelerator.device #torch.device(f"cuda:{str(self.gpu)}" if torch.cuda.is_available() else "cpu")
self.criterion =
self.model = MyModel()
if torch.cuda.is_available():
self.model = self.model.to(self.device)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=self.lr)
self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader=accelerator.prepare(
self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader
)
def vali(self, vali_loader):
self.model.eval()
for i, (input_data, _) in enumerate(vali_loader):
out = self.model(input_data)
loss = self.criterion(out)
all_losses = accelerator.gather(loss)
vali_loss_collect.extend([ i.item() for i in all_losses])
if accelerator.is_local_main_process:
a = np.mean(vali_loss_collect)
vali_loss_collect.clear()
return a
else:
return np.mean([1e9])
def train(self):
early_stopping = EarlyStopping()
for epoch in range(self.num_epochs):
self.model.train()
for i, (input_data, _) in enumerate(self.train_loader):
output = self.model(input_data)
loss = self.criterion(out)
self.optimizer.zero_grad()
accelerator.backward(loss)
self.optimizer.step()
vali_loss = self.vali(self.test_loader)
early_stopping(vali_loss, self.model)
if early_stopping.early_stop:
print("stopped")
break
print(f"epoch:{epoch}")
adjust_learning_rate(self.optimizer, epoch + 1, self.lr)
Expected behavior
The script ran normally during the first epoch at the beginning, but when it reached the second epoch, some process seems to be stuck.
# Since I'm training on 8 GPU , there are 8 print outputs.
Epoch: 1, Epoch: 1,Epoch: 1
Epoch: 1, Epoch: 1,
Epoch: 1,
Epoch: 1,
Epoch: 1,
# at second epoch , only one output
Epoch: 2,
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
Epoch: 2, Epoch: 2,
Epoch: 2,
Epoch: 2,
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
Epoch: 2,
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
Epoch: 2, Epoch: 2,
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[2023-11-21 10:57:17,327] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46035 closing signal SIGTERM
[2023-11-21 10:57:17,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 46036) of binary: /usr/bin/python3.10/bin/python3.10
Traceback (most recent call last):
File "/usr/bin/python3.10/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
main.py FAILED
------------------------------------------------------
Failures:
[1]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 46037)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46037
[2]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 46038)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46038
[3]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 46039)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46039
[4]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 46040)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46040
[5]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 46041)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46041
[6]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 7 (local_rank: 7)
exitcode : -6 (pid: 46042)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46042
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-21_10:57:17
host : d02371f49391
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 46036)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 46036
======================================================
Do you have the possibility to upgrade the Linux kernel? We found that kernel versions < 5.5 could lead to processes hanging.
hi I use docker container and don't have access to the real operating system. Besides I notice that everything appears normal when I train with 6 GPUs, but the aforementioned issue arises when I use all GPUs, which is 8 in total.
A few things I can think of. Did you disable P2P? (Not supported on those cards). I experienced this when running on 2 4090’s, might be the case here potentially. NCCL_P2P_DISABLE=1
. I also had to disable IB: NCCL_IB_DISABLE=1
. Can you run it again and let me know if you still face the timeout/hang?
Also: does it still fail with debug disabled?
@muellerzr Thank you for your response. After trying your suggestion, it seems to be working normally.
btw, may I ask if this is the recommended/correct method for aggregating the final validation results across different training processes
# global
vali_loss_collect = []
# ....
def vali(self, vali_loader):
self.model.eval()
for i, (input_data, _) in enumerate(vali_loader):
out = self.model(input_data)
loss = self.criterion(out)
all_losses = accelerator.gather(loss)
vali_loss_collect.extend([ i.item() for i in all_losses])
if accelerator.is_local_main_process:
a = np.mean(vali_loss_collect)
vali_loss_collect.clear()
return a
else:
return np.mean([1e9])