ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec
Opened this issue · 8 comments
when trying to load quantized models i always get
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True
in the quantization config objec
Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True
while loading the model (change the config.json
in your model file and add disable_exllama: true
to quantization_config
there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.
It can work, Thanks. @aliozts
Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use
disable_exllama=True
while loading the model (change theconfig.json
in your model file and adddisable_exllama: true
toquantization_config
there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.
Disabling Exllama makes the entire inferencing much slower.
Check out AutoGPTQ/AutoGPTQ#406 for how to enable Exllama.
Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450.
Could anyone please help?
I have NVIDIA GTX 1650 still getting same error
在config.json的quantization_config下加入"disable_exllama": true,即可解决问题。
这个错误只有单卡的时候才会出现,多卡时未出现过,使用的显卡为Tesla T4。
I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama
error. In short the issue showed up when I ran the container without --gpus all
command. Below is my system configs
GPU: 1660Ti
transformers==4.36.2
optimum==1.16.1
auto-gptq==0.6.0+cu118
CUDA=12.3
SOLUTION: for me I fixed the disable_exllama
error by running the container with --gpus all
I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.
Here are more details - #3530