RUCAIBox/TextBox

[🐛BUG] 在使用accelerate进行加速时碰到如下问题。

Closed this issue · 8 comments

描述这个 bug
在对accelerate config进行修改后,始终出现CUDA error: invalid device ordinal 这个问题,我计划使用两块GPU,但是总是有如下报错无法正常运行。

如何复现
accelerate launch run_textbox.py --model=T5 --model_path=Langboat/mengzi-t5-base --dataset=lcsts

日志
Traceback (most recent call last):
File "run_textbox.py", line 17, in
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
File "/hy-tmp/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 46, in init
self.accelerator = Accelerator(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/accelerator.py", line 346, in init
self.state = AcceleratorState(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/state.py", line 540, in init
PartialState(cpu, **kwargs)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/state.py", line 136, in init
torch.cuda.set_device(self.device)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/cuda/init.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3982 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3983) of binary: /usr/local/miniconda3/envs/TextBox/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/TextBox/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_textbox.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-21_09:06:46
host : I121b70329100801c6c
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3983)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问使用的pytorch和accelerate版本是?同时请给我一下~/.cache/huggingface/accelerate下面的config内容

accelerate版本是0.18.0,pytorch是1.11.0+cu113
config内容是
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate launch run_textbox.py --model=T5 --model_path=Langboat/mengzi-t5-base --dataset=lcsts --gpu_id=0,1
你需要加上--gpu_id=0,1

accelerate launch run_textbox.py --model=T5 --model_path=Langboat/mengzi-t5-base --dataset=lcsts --gpu_id=0,1 你需要加上--gpu_id=0,1

谢谢您,cuda的问题解决了,现在还有一个问题不知道如何处理
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3453 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3454) of binary: /usr/bin/python3.8
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

出现这个错误的运行指令是什么?可以给一下完整的输出信息吗?

抱歉我看漏了,我运行代码的时候,这个问题依然存在
运行指令:accelerate launch run_textbox.py --model=T5 --model_path=Langboat/mengzi-t5-base --dataset=lcsts --gpu_id=0,1
报错:
Traceback (most recent call last):
File "run_textbox.py", line 17, in
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
File "/hy-tmp/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 46, in init
self.accelerator = Accelerator(
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 346, in init
self.state = AcceleratorState(
File "/usr/local/lib/python3.8/dist-packages/accelerate/state.py", line 540, in init
PartialState(cpu, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/state.py", line 136, in init
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

我们使用了同样的accelerate和pytorch也无法复现你的问题。看这个报错应该是由于cuda和GPU本身导致的问题,或者你可以尝试搜索一下此类错误如何解决。

如果有问题欢迎继续提问