hsiehjackson/RULER

Llama 3 rope theta

ganler opened this issue · 4 comments

Thanks for the great work!

From the README:

The results are evaluated by changing rope_theta to 16M in here.

Can I know the reason for adjusting rope_theta here rather than directly using say dynamic rope scaling? Thanks!

Hi @ganler, we simply change the rope_theta to 16M by following this post. It would be interesting to use dynamic rope scaling without training for Llama3 models. We'll consider adding the results later.

Thank you! We tried to use dynamic RoPE scaling and it significantly improved Llama3 models (https://evalplus.github.io/repoqa.html).

Do you have any hints of why using 16M rope theta can also work much better? Thanks!

Not sure what dynamic RoPE scaling techniques you are referring to. In Hugginface, we have dynamic NTK scaling here to dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value. For why using a large base is useful, there are plenty papers investigating some tricks to change RoPE. https://arxiv.org/pdf/2309.16039
https://arxiv.org/pdf/2310.05209
https://arxiv.org/pdf/2309.00071
https://arxiv.org/pdf/2402.13753

Dynamic RoPE scaling: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

basically dynamically adjusting the scale factor to context_len/model_len when context_len > model_len. It seems to be the same thing your code is showing.

dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value

I don't quite see the similarity but thanks for the references and I will check it. Thank you!