Performance discrepancy of Llama3.1-8b-instruct
zhenyuhe00 opened this issue · 9 comments
Hi,
Thank you for your valuable benchmark. I used your code to test the performance of Llama3.1-8b-instruct on the 16K but got much better results than the number you reported. The config I use is:
llama3.1-8b-instruct)
MODEL_PATH="${MODEL_DIR}/Meta-Llama-3.1-8B-Instruct"
MODEL_TEMPLATE_TYPE="meta-chat"
MODEL_FRAMEWORK="hf"
;;
and the results are:
0,1,2,3,4,5,6,7,8,9,10,11,12,13
Tasks,niah_single_1,niah_single_2,niah_single_3,niah_multikey_1,niah_multikey_2,niah_multikey_3,niah_multivalue,niah_multiquery,vt,cwe,fwe,qa_1,qa_2
Score,100.0,100.0,99.8,99.4,100.0,99.0,97.9,99.75,99.96,91.3,100.0,89.6,99.8
Nulls,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500
which has a mean of 98.19. However, you reported 91.6. I wonder what's the reason of the discrepancy.
For part of the reason for this performance difference, u can checkout my PR here. Basically the Paul Grahams Essay dataset is not fixed the same way each time it's downloaded, which may lead to observable numerical discrepancy in the results.
fix paul grahams essay
@zhenyuhe00 I guess another reason is you should change template for Llama 3.1 models like
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{task_template}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
Default meta-chat
is used for Llama 2 series.
Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get qa_1: 8x
and qa_2: 6x-7x
even for closed-source Gemini
or GPT4
models.
Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get
qa_1: 8x
andqa_2: 6x-7x
even for closed-sourceGemini
orGPT4
models.
LOL I also can't get nearly perfect scores for QA tasks even on 16k context.
@Wangmerlyn @hsiehjackson does the current order of essays match what has been used to populate the tables in the repo and the paper?
Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get
qa_1: 8x
andqa_2: 6x-7x
even for closed-sourceGemini
orGPT4
models.
It is the problem of the Llama 3.1 tokenizer. The text after applying the tokenizer's encode and decode functions would differ (e.g., from ". ..." to "...."). Therefore, the prompt in the question-answering task is incorporated into the prediction because of this line:
if text.startswith(prompt):
It would be better to comment out this line and force to truncate the generated tokens to exclude the prompt.
@zhenyuhe00 I managed to reproduce the results of 3.1 (with the right model template) and I think I haven't been affected by this issue with the tokenizer? Could you please share more specific steps to reproduce the behaviour you're talking about?
@zhenyuhe00 I managed to reproduce the results of 3.1 (with the right model template) and I think I haven't been affected by this issue with the tokenizer? Could you please share more specific steps to reproduce the behaviour you're talking about?
I don't think the tokenizer glitch would cause the issue of reproducibility. However, there could be some underlying unexpected behaviours. If this line is not activated "if prompt in pred", the input prompt will not be truncated from the prediction. Since the metrics are match part and match all, this will lead to a whole score even if the model gets the wrong answer.
@hsiehjackson Can u please take a quick look into this?
Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get
qa_1: 8x
andqa_2: 6x-7x
even for closed-sourceGemini
orGPT4
models.It is the problem of the Llama 3.1 tokenizer. The text after applying the tokenizer's encode and decode functions would differ (e.g., from ". ..." to "...."). Therefore, the prompt in the question-answering task is incorporated into the prediction because of this line:
if text.startswith(prompt):
It would be better to comment out this line and force to truncate the generated tokens to exclude the prompt.
You are probably not using the Pipeline from transformers. Llama 3.1 tokenizer does have this problem of that a text segment may not remain the same before and after encode then decode.
However if you are using the Pipeline from transformers (if you have flash-attn installed then the Pipeline should be used) , the prompt should be removed from the generated tokens (tested on my machine). I guess the reason is that text-generation pipeline is not the same as directly encode and decode the prompt. That's why even using the llama 3.1 tokenizer does not trigger this prompt not being able to be removed glitch.
I have opened a new PR addressing this problem when Pipeline is not used.