huggingface/peft

RuntimeError: CUDA error: invalid device ordinal

yumath opened this issue · 2 comments

System Info

  File "/home/xxx/anaconda3/envs/xxx/lib/python3.11/site-packages/accelerate/state.py", line 211, in __init__
    torch.cuda.set_device(self.device)
  File "/home/xxx/anaconda3/envs/xxx/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

run scripts: examples/sft/run_peft_deepspeed.sh

Expected behavior

run success

@pacman100 do you have an idea?

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.