Multi-gpu training code get stuck after a few iterations
SSSHZ opened this issue · 3 comments
Hi, I tried the multi-gpu training code but the program always got stuck after a few iterations.
Environment:
- pytorch 1.7.1
- cuda 10.2
- gcc version 7.5.0
- Ubuntu 18.04.3 LTS
Reproduce the bug:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 train_net_ddp.py --config_file coco --gpus 4
Output:
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/aa/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
process.wait()
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt```
Hello, I think this bug is caused by allocating too many workers for dataloader. When training with multi gpus, ${--bs} is the number of batch size of a single gpu, the actual batch size of your above command is 24*4=96. For convenience, I directly set the number of workers as the number of batch size. 96 workers for dataloder is probably too many.
You can try set the batch size smaller, such as --bs 6 when using 4 GPUS. Or you can modify num_worker of the function make_ddp_train_loader in dataset/data_loader.py.
After setting train.batch_size = 6
in configs/coco.py
, I tried num_workers=4 or 2 or 0
for make_ddp_train_loader
in dataset/data_loader.py
and the same issue always happened.
Perhaps this bug is not caused by workers for dataloader.
The problem might be caused by the combination of Pytorch 1.7.1, CUDA 10.2, NCCL 2.7.8
.
The easiest solution is to try Pytorch 1.7.0
for me.