jctian98/e2e_lfmmi

How to set up a single machine with multiple gpu?

Closed this issue · 0 comments

ranhl commented

Hello,I am trying to train locally with multiple gpus and I am getting some errors. There are always four on the local machine. I want to set two cards for training. I have set the DDP parameters as follows, but some errors have occurred. Please, how do I set them?

export HOST_GPU_NUM=2
export HOST_NUM=1
export NODE_NUM=1
export INDEX=0

One gpu can be used for normal training, but multiple gpus cannot be executed normally

error log:
2022-08-30 17:59:56,822 (ctc:138) INFO: CTC input lengths: tensor([140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140], device='cuda:0')
2022-08-30 17:59:56,823 (ctc:143) INFO: CTC output lengths: tensor([23, 18, 22, 23, 20, 22, 22, 22, 22, 21, 21, 24, 22, 21, 17, 23], device='cuda:0')
2022-08-30 17:59:56,823 (ctc:154) INFO: ctc loss:1071.3641357421875
2022-08-30 17:59:57,021 (e2e_asr_transducer:92) INFO: loss:1988.887451171875
2022-08-30 17:59:57,528 (asr:250) INFO: on device cuda:0 grad norm=7541.00927734375
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801883 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
2022-08-30 18:30:01,813 (ctc:138) INFO: CTC input lengths: tensor([123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123], device='cuda:1')
2022-08-30 18:30:01,813 (ctc:143) INFO: CTC output lengths: tensor([22, 21, 22, 23, 19, 21, 20, 20, 21, 26, 20, 21, 22, 24, 14, 24], device='cuda:1')
2022-08-30 18:30:01,814 (ctc:154) INFO: ctc loss:936.232421875
2022-08-30 18:30:02,099 (e2e_asr_transducer:92) INFO: loss:1777.2901611328125
2022-08-30 18:30:02,719 (asr:250) INFO: on device cuda:1 grad norm=6572.220703125
Tue Aug 30 18:30:02 2022 | rank: 1 | | iteration: 0 | gradient applied
2022-08-30 18:30:02,909 (ctc:138) INFO: CTC input lengths: tensor([34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34], device='cuda:1')
2022-08-30 18:30:02,910 (ctc:143) INFO: CTC output lengths: tensor([4, 9, 9, 6, 6, 9, 7, 6, 4, 4, 6, 8, 6, 9, 6, 5], device='cuda:1')
2022-08-30 18:30:02,910 (ctc:154) INFO: ctc loss:262.8006286621094
2022-08-30 18:30:02,972 (e2e_asr_transducer:92) INFO: loss:503.36956787109375
2022-08-30 18:30:03,115 (asr:250) INFO: on device cuda:1 grad norm=1751.3741455078125
/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 147095 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 147080) of binary: /home/miniconda3/envs/lfmmi/bin/python
Traceback (most recent call last):
File "/home/miniconda3/envs/lfmmi/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/miniconda3/envs/lfmmi/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: