lm-sys/FastChat

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec

Opened this issue · 8 comments

dinchu commented

when trying to load quantized models i always get

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

It can work, Thanks. @aliozts

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

Disabling Exllama makes the entire inferencing much slower.

Check out AutoGPTQ/AutoGPTQ#406 for how to enable Exllama.

Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450.
Could anyone please help?

I have NVIDIA GTX 1650 still getting same error

在config.json的quantization_config下加入"disable_exllama": true,即可解决问题。
这个错误只有单卡的时候才会出现,多卡时未出现过,使用的显卡为Tesla T4。

I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama error. In short the issue showed up when I ran the container without --gpus all command. Below is my system configs

GPU: 1660Ti
transformers==4.36.2
optimum==1.16.1
auto-gptq==0.6.0+cu118
CUDA=12.3

SOLUTION: for me I fixed the disable_exllama error by running the container with --gpus all

I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.

Here are more details - #3530