intel-analytics/ipex-llm

all-in-one tool for chatglm3-6b: 2nd latency of batch size 1 is larger than batch size 2

Fred-cell opened this issue · 1 comments

ipex-llm version is: 2.5.0b20240510
image

This issue is caused by different logic of inference with batch size=1 and batch size=2. When batch size=1, quantize kv cache will not be enabled automatically, and you may need enable manually by export IPEX_LLM_LOW_MEM=1. When batch size =2, quantize kv cache will be enabled automatically. Noted that enabling quantize kv cache will speed up rest token latency.