KV cache / post-RoPE rotation & quantization in QuaRot
Closed this issue · 5 comments
Hello,
First of all, thank you for your effort in creating and sharing this useful repo!
I'm looking into the code of QuaRot implementation.
My apologies in advance if I'm missing something, but I could not find where your code implements rotation and quantization of KV cache (in particular, of post-RoPE K values).
Do you implement this functionality? If yes, could you please point out where it is made?
Thanks in advance!
No,and we do not implement KV cache quantization.
@Harahan, thank you for your response.
Would you like to consider this as a feature request?
I see that you are also in process of implementing SpinQuant (which is great!),
and I think that neither QuaRot or SpinQuant support can be complete without this feature.
Sorry, we don't plan to add this feature. You can implement this by yourself.
@Harahan,
Sorry for getting back to this point,
But the LLMC paper (https://arxiv.org/abs/2405.06001) explicitly mentions evaluations of KV cache quantization (Appendix A.4).
How can these results be reproduced?
:)
@sasha-hailo For that section, we directly benchmarked the test with LightLLM (simulated quantization under 2-bit and real quantization under 4 and 8-bit).