Llama 3 rope theta
ganler opened this issue · 4 comments
Thanks for the great work!
From the README:
The results are evaluated by changing rope_theta to 16M in here.
Can I know the reason for adjusting rope_theta
here rather than directly using say dynamic rope scaling? Thanks!
Thank you! We tried to use dynamic RoPE scaling and it significantly improved Llama3 models (https://evalplus.github.io/repoqa.html).
Do you have any hints of why using 16M rope theta can also work much better? Thanks!
Not sure what dynamic RoPE scaling techniques you are referring to. In Hugginface, we have dynamic NTK scaling here to dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value. For why using a large base is useful, there are plenty papers investigating some tricks to change RoPE. https://arxiv.org/pdf/2309.16039
https://arxiv.org/pdf/2310.05209
https://arxiv.org/pdf/2309.00071
https://arxiv.org/pdf/2402.13753
Dynamic RoPE scaling: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/
basically dynamically adjusting the scale factor to context_len/model_len
when context_len > model_len
. It seems to be the same thing your code is showing.
dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value
I don't quite see the similarity but thanks for the references and I will check it. Thank you!