mit-han-lab/qserve

can you support static per-token activation quantization?

geqian-9192 opened this issue · 1 comments

can you support static per-token activation quantization, as dynamic quantization is inefficient on hardware?

Hi,

Thanks for your interests in QServe. We fused quantization ops into mem-bounded ops such as layernorm, silu, etc. Thus, the activation quantization overhead is minimal and negligible. Please refer to our paper for more details.