mit-han-lab/qserve

LLama-3-8B model dumped by LMQuant in 4w8a set raises errors when running e2e benchmark in QServe.

Patrick-Lew opened this issue · 1 comments

I dumped the quantised llama-3-8B model from LMQuant, using QoQ, the command as follows written in
lmquant/projects/llm/scripts
/qoq.sh
# QoQ (W4A8KV4 with per-channel weight quantization) on Llama3-8B python -m lmquant.llm.run configs/llm.yaml configs/qoq/gchn.yaml --model-name llama3-8b --smooth-xw-alpha 0.05 --smooth-xw-beta 0.95 --smooth-yx-strategy GridSearch --smooth-yx-beta " -2"
and I append --save-model and --model-path to that command.
Then I run the checkpoint_convert.py script in QServe, get the checkpoint model.
Screenshot 2024-08-12 at 12 07 22 PM
then run e2e by this command
Screenshot 2024-08-12 at 12 07 56 PM
but the results are like
Screenshot 2024-08-12 at 12 08 26 PM

I want to know whether this e2e benchmark can only be run on the llama-3-8B-instruct model that you kindly provided in huggingface repo?
I also try running e2e benchmark on other model like llama-3-8B(non-instruct) but it also raises error like above.

Thanks.
Patrick

Hi,

Thanks for your interests in QServe! We would suggest you kindly use instruction-tuned models for the e2e generation for robust outputs. The current conversation template is designed for instruct models.