TsinghuaAI/CPM-1-Finetune

RuntimeError: cuda runtime error (10)

Closed this issue · 1 comments

drxmy commented

您好,我在运行finetune_lm的时候遇到了这个报错,命令:bash scripts/language_model/finetune_lm_large_fp32.sh
奇怪的是程序还会继续运行。查了报错,也没能解决。请问这是什么原因呢?更详细的报错如下:
`using world size: 8 and model-parallel size: 2

using dynamic loss scaling
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
File "finetune_lm.py", line 315, in
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
main()
File "finetune_lm.py", line 180, in main
Traceback (most recent call last):
File "finetune_lm.py", line 315, in
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
initialize_distributed(args)
File "/CPM/utils.py", line 544, in initialize_distributed
main()
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
File "finetune_lm.py", line 180, in main
Traceback (most recent call last):
torch.cuda.set_device(device)
File "finetune_lm.py", line 315, in
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
initialize_distributed(args)
File "/CPM/utils.py", line 544, in initialize_distributed
Traceback (most recent call last):
File "finetune_lm.py", line 315, in
main()
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
File "finetune_lm.py", line 180, in main
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
initialize_distributed(args)
main()
File "finetune_lm.py", line 180, in main
File "/CPM/utils.py", line 544, in initialize_distributed
torch.cuda.set_device(device)
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
initialize_distributed(args)
File "/CPM/utils.py", line 544, in initialize_distributed
Traceback (most recent call last):
File "finetune_lm.py", line 315, in
torch.cuda.set_device(device)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
torch.cuda.set_device(device)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
Traceback (most recent call last):
torch._C._cuda_setDevice(device)
File "finetune_lm.py", line 315, in
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
main()
File "finetune_lm.py", line 180, in main
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
initialize_distributed(args)
File "/CPM/utils.py", line 544, in initialize_distributed
main()
File "finetune_lm.py", line 180, in main
torch.cuda.set_device(device)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
initialize_distributed(args)
File "/CPM/utils.py", line 544, in initialize_distributed
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
torch.cuda.set_device(device)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 281, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37
[2021-07-19 02:42:24,618] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.183 seconds.
Prefix dict has been built successfully.
number of parameters on model parallel rank 1: 1300096000`

drxmy commented

服务器上有8张显卡,使用的是您提供的镜像,创建容器时也指定了GPU0和1,但是运行时打印的world_size还是8。我新创建了一个容器使用全部GPU倒是不报RuntimeError: cuda runtime error (10)这个错误了