Performance discrepancy of Llama3.1-8b-instruct

Question

Performance discrepancy of Llama3.1-8b-instruct

zhenyuhe00 opened this issue 7 months ago · 9 comments

Hi,
Thank you for your valuable benchmark. I used your code to test the performance of Llama3.1-8b-instruct on the 16K but got much better results than the number you reported. The config I use is:

llama3.1-8b-instruct)
    MODEL_PATH="${MODEL_DIR}/Meta-Llama-3.1-8B-Instruct"
    MODEL_TEMPLATE_TYPE="meta-chat"
    MODEL_FRAMEWORK="hf"
    ;;

and the results are:

0,1,2,3,4,5,6,7,8,9,10,11,12,13
Tasks,niah_single_1,niah_single_2,niah_single_3,niah_multikey_1,niah_multikey_2,niah_multikey_3,niah_multivalue,niah_multiquery,vt,cwe,fwe,qa_1,qa_2
Score,100.0,100.0,99.8,99.4,100.0,99.0,97.9,99.75,99.96,91.3,100.0,89.6,99.8
Nulls,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500,0/500

which has a mean of 98.19. However, you reported 91.6. I wonder what's the reason of the discrepancy.

Answer 1 · 2024-08-20T13:00:56.000Z

For part of the reason for this performance difference, u can checkout my PR here. Basically the Paul Grahams Essay dataset is not fixed the same way each time it's downloaded, which may lead to observable numerical discrepancy in the results.
fix paul grahams essay

Answer 2 · 2024-08-20T17:55:24.000Z

@zhenyuhe00 I guess another reason is you should change template for Llama 3.1 models like
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{task_template}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
Default meta-chat is used for Llama 2 series.

Answer 3 · 2024-08-20T18:00:01.000Z

Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get qa_1: 8x and qa_2: 6x-7x even for closed-source Gemini or GPT4 models.

Answer 4 · 2024-08-20T18:07:02.000Z

Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get qa_1: 8x and qa_2: 6x-7x even for closed-source Gemini or GPT4 models.

LOL I also can't get nearly perfect scores for QA tasks even on 16k context.

Answer 5 · 2024-08-21T01:21:49.000Z

@Wangmerlyn @hsiehjackson does the current order of essays match what has been used to populate the tables in the repo and the paper?

Answer 6 · 2024-08-22T08:04:25.000Z

Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get qa_1: 8x and qa_2: 6x-7x even for closed-source Gemini or GPT4 models.

It is the problem of the Llama 3.1 tokenizer. The text after applying the tokenizer's encode and decode functions would differ (e.g., from ". ..." to "...."). Therefore, the prompt in the question-answering task is incorporated into the prediction because of this line:
if text.startswith(prompt):
It would be better to comment out this line and force to truncate the generated tokens to exclude the prompt.

Answer 7 · 2024-08-22T21:30:40.000Z

@zhenyuhe00 I managed to reproduce the results of 3.1 (with the right model template) and I think I haven't been affected by this issue with the tokenizer? Could you please share more specific steps to reproduce the behaviour you're talking about?

Answer 8 · 2024-08-23T03:11:59.000Z

@zhenyuhe00 I managed to reproduce the results of 3.1 (with the right model template) and I think I haven't been affected by this issue with the tokenizer? Could you please share more specific steps to reproduce the behaviour you're talking about?

I don't think the tokenizer glitch would cause the issue of reproducibility. However, there could be some underlying unexpected behaviours. If this line is not activated "if prompt in pred", the input prompt will not be truncated from the prediction. Since the metrics are match part and match all, this will lead to a whole score even if the model gets the wrong answer.
@hsiehjackson Can u please take a quick look into this?

Answer 9 · 2024-12-15T10:49:14.000Z

Also, it is interesting you can get nearly perfect scores for QA tasks. From my side, I usually get qa_1: 8x and qa_2: 6x-7x even for closed-source Gemini or GPT4 models.

It is the problem of the Llama 3.1 tokenizer. The text after applying the tokenizer's encode and decode functions would differ (e.g., from ". ..." to "...."). Therefore, the prompt in the question-answering task is incorporated into the prediction because of this line: if text.startswith(prompt): It would be better to comment out this line and force to truncate the generated tokens to exclude the prompt.

You are probably not using the Pipeline from transformers. Llama 3.1 tokenizer does have this problem of that a text segment may not remain the same before and after encode then decode.
However if you are using the Pipeline from transformers (if you have flash-attn installed then the Pipeline should be used) , the prompt should be removed from the generated tokens (tested on my machine). I guess the reason is that text-generation pipeline is not the same as directly encode and decode the prompt. That's why even using the llama 3.1 tokenizer does not trigger this prompt not being able to be removed glitch.
I have opened a new PR addressing this problem when Pipeline is not used.