0cc4m/KoboldAI

Can't split 4bit model between gpu/cpu, and can't run only on cpu

tdtrumble opened this issue · 1 comments

I am able to load 4bit GPTQ models all the way up to 30/33b just on my gpu (4090) just fine, however, when attempting to load 60b solely to cpu (turn both sliders on load dialog to 0) I get an error "indexerror: list out of range" and it won't even attempt to load. I have 128GB of RAM. When I try to split between cpu/gpu (set gpu preload to any value, tried 55 and 10 with same results) then it loads ok but errors during inference saying "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!". What am I missing? Do 4bit GPTQ models only run on GPU?

GPTQ models only live on GPU.