2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue

Question

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue

Fred-cell opened this issue a month ago · 1 comments

the 2nd latency of llama3-8b-instruct with int4 and bs=1 is larger than bs=2, ipex-llm=2.5.0b20240504

Answer 1 · 2024-05-07T01:43:47.000Z

Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu API.