Slow speed for some models.
BadisG opened this issue · 4 comments
Hey,
I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main
For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.
Some other models (like raw llama 13b) gives me 7 tokens/s which is fine
I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.
Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool
Thanks in advance.
When loading a model, it tells you the quantization version. Versions 0 and 2 are slow. 0 because it is old, 2 because upstream GPTQ prefers accuracy over speed. If you want fast models, use version 1. They usually show up on Hugginface as compatible with KoboldAI.
But still, even with those "slow" models I have I can get 10 tokens/s on ooba's webui, so it means there's a way to get the same speed on KoboldAI
If you can't achieve that, I have then 2 questions:
- How do you make a "Version 1 GPTQ" when you decided to quantize a model?
- Do you loose a lot of accuracy when using the version 1?
I've tried a few models and am seeing the same. 2 tk/s with this version of kobald ai (same speed as standard) and 10-12 tk/s with oogabooga same models using exllama.