RuntimeError: CUDA error: invalid device ordinal
yumath opened this issue · 2 comments
yumath commented
System Info
File "/home/xxx/anaconda3/envs/xxx/lib/python3.11/site-packages/accelerate/state.py", line 211, in __init__
torch.cuda.set_device(self.device)
File "/home/xxx/anaconda3/envs/xxx/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
run scripts: examples/sft/run_peft_deepspeed.sh
Expected behavior
run success
BenjaminBossan commented
@pacman100 do you have an idea?
github-actions commented
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.