Multi GPU process stuck

Question

Multi GPU process stuck

xwyzsn opened this issue a year ago · 6 comments

System Info

- `Accelerate` version: 0.24.1
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.11
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.1.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 125.41 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: True
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

my script is below,I removed some unnecessary code. If this code snippet doesn't help resolve the issue, I will provide the entire code repository.

vali_loss_collect=[]

class Trainer(object):

    def __init__(self):

        self.train_loader =  #
        self.vali_loader = #
        self.test_loader = #
    
        self.device = accelerator.device  #torch.device(f"cuda:{str(self.gpu)}" if torch.cuda.is_available() else "cpu")

        self.criterion =
        self.model = MyModel()
        if torch.cuda.is_available():
            self.model = self.model.to(self.device)
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=self.lr)
        self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader=accelerator.prepare(
            self.model,self.optimizer,self.train_loader,self.vali_loader,self.test_loader
        )

    def vali(self, vali_loader):
        self.model.eval()
        for i, (input_data, _) in enumerate(vali_loader):
            out = self.model(input_data)
            loss = self.criterion(out)
            all_losses = accelerator.gather(loss)
            vali_loss_collect.extend([ i.item() for i in all_losses])
            
        if accelerator.is_local_main_process:
            a = np.mean(vali_loss_collect)
            vali_loss_collect.clear()
            return a
        else:
            return np.mean([1e9])
        

    def train(self):
        early_stopping = EarlyStopping()
        for epoch in range(self.num_epochs):
            self.model.train()
            for i, (input_data, _) in enumerate(self.train_loader):
                output = self.model(input_data)
                loss = self.criterion(out)
                self.optimizer.zero_grad()
                accelerator.backward(loss)
                self.optimizer.step()
            vali_loss = self.vali(self.test_loader)
            early_stopping(vali_loss, self.model)
            if early_stopping.early_stop:
                print("stopped")
                break
            print(f"epoch:{epoch}")

            adjust_learning_rate(self.optimizer, epoch + 1, self.lr)

Expected behavior

The script ran normally during the first epoch at the beginning, but when it reached the second epoch, some process seems to be stuck.

# Since I'm training on 8 GPU , there are 8 print outputs.
Epoch: 1, Epoch: 1,Epoch: 1


Epoch: 1, Epoch: 1,

Epoch: 1,  
Epoch: 1,  
Epoch: 1,  


# at second epoch , only one output 
Epoch: 2, 
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
Epoch: 2,   Epoch: 2, 
Epoch: 2,  

Epoch: 2,  
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
Epoch: 2,  
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
Epoch: 2,   Epoch: 2, 

[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800368 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800995 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800676 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800960 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48490, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800392 milliseconds before timing out.
[2023-11-21 10:57:17,327] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 46035 closing signal SIGTERM
[2023-11-21 10:57:17,441] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 46036) of binary: /usr/bin/python3.10/bin/python3.10
Traceback (most recent call last):
  File "/usr/bin/python3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/usr/bin/python3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/bin/python3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
main.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 46037)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46037
[2]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 46038)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46038
[3]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 46039)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46039
[4]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 46040)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46040
[5]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 6 (local_rank: 6)
  exitcode  : -6 (pid: 46041)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46041
[6]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 46042)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46042
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-21_10:57:17
  host      : d02371f49391
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 46036)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 46036
======================================================

Answer 1 · 2023-11-21T11:25:14.000Z

Do you have the possibility to upgrade the Linux kernel? We found that kernel versions < 5.5 could lead to processes hanging.

Answer 2 · 2023-11-21T16:19:06.000Z

hi I use docker container and don't have access to the real operating system. Besides I notice that everything appears normal when I train with 6 GPUs, but the aforementioned issue arises when I use all GPUs, which is 8 in total.

Answer 3 · 2023-11-21T16:51:01.000Z

A few things I can think of. Did you disable P2P? (Not supported on those cards). I experienced this when running on 2 4090’s, might be the case here potentially. NCCL_P2P_DISABLE=1. I also had to disable IB: NCCL_IB_DISABLE=1. Can you run it again and let me know if you still face the timeout/hang?

Answer 4 · 2023-11-21T16:57:03.000Z

Also: does it still fail with debug disabled?

Answer 5 · 2023-11-22T04:05:59.000Z

@muellerzr Thank you for your response. After trying your suggestion, it seems to be working normally.

Answer 6 · 2023-11-22T04:13:28.000Z

btw, may I ask if this is the recommended/correct method for aggregating the final validation results across different training processes

# global 
vali_loss_collect = []

# ....
    def vali(self, vali_loader):
        self.model.eval()
        for i, (input_data, _) in enumerate(vali_loader):
            out = self.model(input_data)
            loss = self.criterion(out)
            all_losses = accelerator.gather(loss)
            vali_loss_collect.extend([ i.item() for i in all_losses])
            
        if accelerator.is_local_main_process:
            a = np.mean(vali_loss_collect)
            vali_loss_collect.clear()
            return a
        else:
            return np.mean([1e9])