PygmalionAI/aphrodite-engine

[Feature]: Exllamav2 Q4, Q6, and Q8 cache

Opened this issue ยท 3 comments

๐Ÿš€ The feature, motivation and pitch

Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.

Additional context

Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md

It's definitely a planned feature. I believe @sgsdxzy wanted to work on it.

alright, feel free to close this issue when that's done.

also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch