2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue
Fred-cell opened this issue · 1 comments
Fred-cell commented
lalalapotter commented
Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu
API.