SqueezeAILab/KVQuant

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Python

Issues

How to reproduce the table 19 (kvquant vs kivi)
#14 opened 16 days ago by condy0919
0
CUDA error: an illegal memory access was encountered
#9 opened 4 months ago by CUHKSZzxy
2
Coupled Channel-wise Quantization
#12 opened 2 months ago by naston
0
Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention
#11 opened 2 months ago by naston
0
PRE-ROPE quantization during inference
#1 opened 3 months ago by minghaoBD
1
The value of self.include_sparse being 0 causes the assert (False) error
#6 opened 5 months ago by ascendpoet
1
[Question] How to run HF model with 1m-length tokens in your exp?
#10 opened 4 months ago by 1649759610
0
Question about storage
#8 opened 4 months ago by mlxht990720
0
Where is the code of "ATOM-4bit"in the KVQuant codebase?
#7 opened 5 months ago by leoliu1979
0
problem when reproduce experiment
#5 opened 5 months ago by cat538
1
reproduce the ablation results in Figure 1
#2 opened 6 months ago by SherrySwift
1
AttributeError: 'LlamaModel' object has no attribute 'split_gpus'
#4 opened 6 months ago by seeyourcell
1
Can this be done for other transformer based models?
#3 opened 6 months ago by caroljoyv
1