mit-han-lab/qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

PythonApache-2.0

Issues

use qserve with tensorrt-llm raise an error
#45 opened a month ago by anaivebird
0
qserve with tensorrt-llm is slower and awq int4 for llama2-7b
#46 opened a month ago by anaivebird
0
How to test the accuracy?
#42 opened 2 months ago by lisuying214
1
Would this work on consumer hardware and integrated in frameworks like llama.cpp or others?
#5 opened 7 months ago by Mayorc1978
4
Does openai compatible server supported?
#43 opened 2 months ago by anaivebird
0
pip install -e .
#41 opened 2 months ago by lisuying214
0
Some questions about VLM quant
#40 opened 2 months ago by hanhanpp
0
Question about pagedattention
#36 opened 4 months ago by SherrySwift
1
can you support static per-token activation quantization?
#31 opened 3 months ago by geqian-9192
1
LLama-3-8B model dumped by LMQuant in 4w8a set raises errors when running e2e benchmark in QServe.
#29 opened 4 months ago by Patrick-Lew
1
[inf nan] got `inf`, `nan` or element < 0
#38 opened 3 months ago by yunyipower
1
How to add new models?
#33 opened 4 months ago by NicolasDrapier
0
RMSNorm implemented as LayerNorm
#32 opened 4 months ago by jason-huang03
0
[New Feature] Will MLA Be Supported?
#28 opened 5 months ago by RanchiZhao
0
The outpout of given model(mit-han-lab/Llama-3-8B-QServe-g128) is mistaken
#21 opened 6 months ago by haichuan1221
2
[New Model Supported] MiniCPM-2B
#24 opened 5 months ago by RanchiZhao
0
Question about dequantization overhead
#23 opened 6 months ago by DD-DuDa
3
Request for Benchmark Code for W4A8 GEMM and KV4 Attention
#20 opened 5 months ago by DD-DuDa
0
Questions about FP8 and H100
#19 opened 5 months ago by sijiac
1
How can we reproduce Table.2 and 3 ? (PPL and zero-shot Acc)
#25 opened 5 months ago by kriskrisliu
2
Circular import error
#22 opened 6 months ago by LuckyLYM
1
Is the Table.3 accuracy tested with dequantized weights, or tested on real accelerated quantized kernels?
#17 opened 7 months ago by vovoluck
1
Expected speed for llama3-70b-instruct
#18 opened 7 months ago by ethxnp
1
support tp
#14 opened 7 months ago by cyLi-Tiger
2
has anyone tried to HIPify this for AMD/ROCm
#16 opened 7 months ago by ehartford
0
fast dequantization in per-ch
#15 opened 7 months ago by yanghaihui
0
activation quantization
#13 opened 7 months ago by hanhanpp
1
Llama-2-7B-QServe model doesn't give the expected output
#11 opened 7 months ago by MuYu-zhi
2
Couldn't instantiate the backend tokenizer
#8 opened 7 months ago by Rudin6
1
Any performance comparsion with vllm?
#12 opened 7 months ago by MuYu-zhi
1
Source code
#3 opened 8 months ago by jph00
2
Question about the paper
#10 opened 7 months ago by jameswu2014
3
Accuracy on Qwen1.5-72B
#9 opened 7 months ago by cyLi-Tiger
1
lmquant for QoQ quantization and fake-quantized model dumping
#7 opened 7 months ago by SimpleTheoryOfTypes
1
Is 8bit supported?
#2 opened 8 months ago by nivibilla
3
How to Quantize CNN Layers (non MLP layers in general) using qserve?
#1 opened 8 months ago by satabios
1