ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec

Question

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec

Opened this issue a year ago · 8 comments

when trying to load quantized models i always get

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec

Answer 1 · 2023-09-21T21:13:07.000Z

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

Answer 2 · 2023-10-05T13:05:27.000Z

It can work, Thanks. @aliozts

Answer 3 · 2023-11-07T04:04:55.000Z

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

Disabling Exllama makes the entire inferencing much slower.

Check out AutoGPTQ/AutoGPTQ#406 for how to enable Exllama.

Answer 4 · 2023-11-30T11:14:02.000Z

Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450.
Could anyone please help?

Answer 5 · 2023-12-24T21:38:54.000Z

I have NVIDIA GTX 1650 still getting same error

Answer 6 · 2024-01-08T12:12:03.000Z

在config.json的quantization_config下加入"disable_exllama": true，即可解决问题。
这个错误只有单卡的时候才会出现，多卡时未出现过，使用的显卡为Tesla T4。

Answer 7 · 2024-01-10T02:07:17.000Z

I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama error. In short the issue showed up when I ran the container without --gpus all command. Below is my system configs

GPU: 1660Ti
transformers==4.36.2
optimum==1.16.1
auto-gptq==0.6.0+cu118
CUDA=12.3

SOLUTION: for me I fixed the disable_exllama error by running the container with --gpus all

Answer 8 · 2024-09-19T21:12:51.000Z

I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.

Here are more details - #3530