Fail to train in multi gpu

Question

Fail to train in multi gpu

didadida-r opened this issue a year ago · 1 comments

Hi,

i can try the codec model in single gpu, but i cannot train it in multigpu mode. the log is

the env is

CUDA Version: 12.2 
alias-free-torch         0.0.6
pytorch-wpe              0.0.1
torch                    1.13.1
torch-complex            0.4.3
torchaudio               0.13.1
torchvision              0.14.1

the main log is

./run_freqcodec.sh: gpu_num: 2
stage 3: Training
log can be found at ./exp/freqcodec_mag_phase_16k_n32_600k_step_ds640/log/train.log.0

the detail log is

-rw-rw-r-- 1 test test   0 Jan 18 11:07 train.log.0
-rw-rw-r-- 1 test test 884 Jan 18 11:07 train.log.1

cat train.log.1
Traceback (most recent call last):
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/test/code/enhance/FunCodec/funcodec/bin/codec_train.py", line 32, in <module>
    torch.cuda.set_device(args.gpu_id)
  File "/home/test/python_env/anaconda3/envs/fun_codec/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Answer 1 · 2024-01-25T13:03:32.000Z

Thanks for your report, this issue cased by the different torch version for DDP. You can try comment line 32 at funcodec/bin/codec_train.py as follows:

# torch.cuda.set_device(args.gpu_id)