Slow speed for some models.

Question

Slow speed for some models.

BadisG opened this issue a year ago · 4 comments

Hey,

I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main

For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.

Some other models (like raw llama 13b) gives me 7 tokens/s which is fine

I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.

Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool

Thanks in advance.

Answer 1 · 2023-05-21T11:58:03.000Z

When loading a model, it tells you the quantization version. Versions 0 and 2 are slow. 0 because it is old, 2 because upstream GPTQ prefers accuracy over speed. If you want fast models, use version 1. They usually show up on Hugginface as compatible with KoboldAI.

Answer 2 · 2023-05-21T12:05:05.000Z

When loading a model, it tells you the quantization version.

Oh yeah I have the Version 2

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

Answer 3 · 2023-05-21T19:55:36.000Z

But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI

If you can't achieve that, I have then 2 questions:

How do you make a "Version 1 GPTQ" when you decided to quantize a model?
Do you loose a lot of accuracy when using the version 1?

Answer 4 · 2023-08-05T05:33:43.000Z

I've tried a few models and am seeing the same. 2 tk/s with this version of kobald ai (same speed as standard) and 10-12 tk/s with oogabooga same models using exllama.