Can't reproduce results of meta-llama/Meta-Llama-3.1-8B-Instruct

Question

Can't reproduce results of meta-llama/Meta-Llama-3.1-8B-Instruct

Closed this issue a month ago · 4 comments

Hey,

Thanks for your amazing work - it's really really amazing.

I wanted to reproduce the results of meta-llama/Meta-Llama-3.1-8B-Instruct, reported in Table 10 in your paper. I believe it should be perfect accuracy for PasskeyRetrieval at 128k, but I get 62.5.

I did the following:

> Setup

docker run --gpus all -it --rm cphsieh/ruler:0.1.0
cd /../RULER

# Below are needed due to various errors
pip install transformers==4.43.1
pip install huggingface-hub==0.23.2
pip install vllm==0.5.4

> Downloaded meta-llama/Meta-Llama-3.1-8B-Instruct
> Changed the number of input samples to 40
> Left only "niah_multikey_3" task

> Defined the model as

        llama3.1)
            MODEL_PATH="${MODEL_DIR}/Meta-Llama-3.1-8B-Instruct"
            MODEL_TEMPLATE_TYPE="llama-3.1"
            MODEL_FRAMEWORK="vllm"

> Defined the following template

    "llama-3.1": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{task_template}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",

And I got

       0                1
0  Tasks  niah_multikey_3
1  Score             62.5
2  Nulls             0/40

Could you please help me what I should do here to get the reported perfect accuracy?

Many thanks!

PiotrNawrot commented a month ago

Thanks!

Answer 1 · 2024-08-13T22:08:29.000Z

Hi @PiotrNawrot, if you want to test PasskeyRetrieval proposed by landmark attention, I think the task name should be niah_single_1. In your setting, niah_multikey_3 is a very hard task since we use UUID as the needle key-value and we have multiple distracted needles in the context.

Answer 2 · 2024-08-14T08:18:58.000Z

Oh right, sorry, I think I got confused by the naming of all these tasks and was assured that passkey_retrieval = kv_retrieval.

Could you please confirm that you used the same model template for testing llama 3.1?

Answer 3 · 2024-08-14T17:12:18.000Z

Yes, your template is the same as mine to evaluate Llama 3.1 series.