0cc4m/KoboldAI

Slow speed for some models.

BadisG opened this issue · 4 comments

BadisG commented

Hey,

I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main

For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.
image

Some other models (like raw llama 13b) gives me 7 tokens/s which is fine

I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.

Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool

Thanks in advance.

0cc4m commented

When loading a model, it tells you the quantization version. Versions 0 and 2 are slow. 0 because it is old, 2 because upstream GPTQ prefers accuracy over speed. If you want fast models, use version 1. They usually show up on Hugginface as compatible with KoboldAI.

BadisG commented

When loading a model, it tells you the quantization version.

Oh yeah I have the Version 2

image

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

BadisG commented

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

If you can't achieve that, I have then 2 questions:

  1. How do you make a "Version 1 GPTQ" when you decided to quantize a model?
  2. Do you loose a lot of accuracy when using the version 1?

I've tried a few models and am seeing the same. 2 tk/s with this version of kobald ai (same speed as standard) and 10-12 tk/s with oogabooga same models using exllama.