[Feature]: Exllamav2 Q4, Q6, and Q8 cache

Question

[Feature]: Exllamav2 Q4, Q6, and Q8 cache

Opened this issue 2 months ago · 3 comments

🚀 The feature, motivation and pitch

Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.

Additional context

Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md

Answer 1 · 2024-05-09T18:50:32.000Z

It's definitely a planned feature. I believe @sgsdxzy wanted to work on it.

Answer 2 · 2024-05-09T19:02:00.000Z

alright, feel free to close this issue when that's done.

Answer 3 · 2024-06-08T20:56:35.000Z

also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch