intel-analytics/ipex-llm

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue

Fred-cell opened this issue · 1 comments

the 2nd latency of llama3-8b-instruct with int4 and bs=1 is larger than bs=2, ipex-llm=2.5.0b20240504
image
image

Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu API.