all-in-one tool for chatglm3-6b: 2nd latency of batch size 1 is larger than batch size 2
Fred-cell opened this issue · 1 comments
Fred-cell commented
lalalapotter commented
This issue is caused by different logic of inference with batch size=1 and batch size=2. When batch size=1, quantize kv cache will not be enabled automatically, and you may need enable manually by export IPEX_LLM_LOW_MEM=1
. When batch size =2, quantize kv cache will be enabled automatically. Noted that enabling quantize kv cache will speed up rest token latency.