artidoro/qlora

RuntimeError: CUDA error: an illegal memory access was encountered

flaviadeutsch opened this issue ยท 6 comments

Qlora LLaMa 13B

  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 270, in step
    torch.cuda.synchronize()
  File "/home/hysz/anaconda3/envs/qlora/lib/python3.10/site-packages/torch/cuda/__init__.py", line 688, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I got this as well today

What's your torch version. I use torch 2.0 at first and got same problem, then I degraded it to 1.13.1 and works well. Hope helpful

I also got this when using decapoda-research/llama-7b-hf. With another hf conversion (more recent I think) I did not get the problem. I recommend using newer conversions if possible.

It looks like it can also be fixed by downgrading torch but I haven't verified it.

Which hf conversion please

latest (nigthly )torch 2.0 same error ,but --per_device_train_batch_size 2 --gradient_accumulation_steps 1 ok, --per_device_train_batch_size set 3 then an illegal memory access was encountered
#82

โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
1%|โ–‹ | 62/10000 [05:00<13:21:35, 4.84s/it]
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora#
root@8d5798ac79ee:/wzh/qlora# pip list|grep torch
pytorch-triton 2.1.0+440fd1bf20
torch 1.13.1
torchaudio 2.1.0.dev20230622+cu121
torchsparseattn 0.2
torchvision 0.16.0.dev20230622+cu121

torch 1.13.1 also error