[Feature]: Exllamav2 Q4, Q6, and Q8 cache
Opened this issue ยท 3 comments
Anthonyg5005 commented
๐ The feature, motivation and pitch
Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.
Additional context
Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md
Anthonyg5005 commented
alright, feel free to close this issue when that's done.
Anthonyg5005 commented
also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch