Can't reproduce results of meta-llama/Meta-Llama-3.1-8B-Instruct
Closed this issue · 4 comments
Hey,
Thanks for your amazing work - it's really really amazing.
I wanted to reproduce the results of meta-llama/Meta-Llama-3.1-8B-Instruct, reported in Table 10 in your paper. I believe it should be perfect accuracy for PasskeyRetrieval at 128k, but I get 62.5.
I did the following:
> Setup
docker run --gpus all -it --rm cphsieh/ruler:0.1.0
cd /../RULER
# Below are needed due to various errors
pip install transformers==4.43.1
pip install huggingface-hub==0.23.2
pip install vllm==0.5.4
> Downloaded meta-llama/Meta-Llama-3.1-8B-Instruct
> Changed the number of input samples to 40
> Left only "niah_multikey_3" task
> Defined the model as
llama3.1)
MODEL_PATH="${MODEL_DIR}/Meta-Llama-3.1-8B-Instruct"
MODEL_TEMPLATE_TYPE="llama-3.1"
MODEL_FRAMEWORK="vllm"
> Defined the following template
"llama-3.1": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{task_template}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
And I got
0 1
0 Tasks niah_multikey_3
1 Score 62.5
2 Nulls 0/40
Could you please help me what I should do here to get the reported perfect accuracy?
Many thanks!
Hi @PiotrNawrot, if you want to test PasskeyRetrieval
proposed by landmark attention, I think the task name should be niah_single_1
. In your setting, niah_multikey_3
is a very hard task since we use UUID as the needle key-value and we have multiple distracted needles in the context.
Oh right, sorry, I think I got confused by the naming of all these tasks and was assured that passkey_retrieval = kv_retrieval
.
Could you please confirm that you used the same model template for testing llama 3.1
?
Yes, your template is the same as mine to evaluate Llama 3.1 series
.
Thanks!