Fail to train in multi gpu
didadida-r opened this issue · 1 comments
didadida-r commented
Hi,
i can try the codec model in single gpu, but i cannot train it in multigpu mode. the log is
the env is
CUDA Version: 12.2
alias-free-torch 0.0.6
pytorch-wpe 0.0.1
torch 1.13.1
torch-complex 0.4.3
torchaudio 0.13.1
torchvision 0.14.1
the main log is
./run_freqcodec.sh: gpu_num: 2
stage 3: Training
log can be found at ./exp/freqcodec_mag_phase_16k_n32_600k_step_ds640/log/train.log.0
the detail log is
-rw-rw-r-- 1 test test 0 Jan 18 11:07 train.log.0
-rw-rw-r-- 1 test test 884 Jan 18 11:07 train.log.1
cat train.log.1
Traceback (most recent call last):
File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/test/code/enhance/FunCodec/funcodec/bin/codec_train.py", line 32, in <module>
torch.cuda.set_device(args.gpu_id)
File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ZhihaoDU commented
Thanks for your report, this issue cased by the different torch version for DDP. You can try comment line 32 at funcodec/bin/codec_train.py as follows:
# torch.cuda.set_device(args.gpu_id)