Repetition with Llama3-70b and EETQ
mjsteele12 opened this issue · 2 comments
First of all, thank you for EETQ!
I am using EETQ with TGI to serve LLama3-70b-instruct. I have noticed that compared to other quants (bnb-nf4, awq) the repetition that I get from Llama3 is significantly higher (all other generation parameters the same). I am doing a lot of structured extraction with TGI's grammar e.g. topic classification/extraction. With EETQ, the responses I get may look like this:
["Math",
"Science",
"Reading",
"Math",
"Science",
"Reading",
"Math",
"Science",
"Reading",
"Math",
"Science",
"Reading"]
While with other quants I get something like the expected:
["Math",
"Science",
"Reading"]
For all quants I'm using repetition_penalty of 1.1, temperature .1, top_p .95, but as stated I'm only observing this with EETQ. I have absolutely no idea how to debug this, or if it is even possible, but the repetition issue holds across prompts and inputs so I wanted to share. I'm using the TGI docker container, and the official Llama3-70b-instruct (for bnb-nf4).
I'm wondering if anyone else has come across this or has any insights.
The problem may be caused by the precision degradation of quantization. To verify it, please use logger.info in TGI to print the logits of EETQ and compare it with those of the original model or etc. I guess there is no big difference between the logit of ']' and ',' leading to the logit of ',' bigger than that of ']' after quantized by EETQ.
This issue was closed because it has been stalled for 30 days with no activity. Please feel free to reopen it if needed.