[LLaMA] Rotary positional embedding differs with official implementation
lytning98 opened this issue · 9 comments
transformers
implement LLaMA model's Rotary Positional Embedding (RoPE) as follows:
transformers/src/transformers/models/llama/modeling_llama.py
Lines 173 to 188 in e42587f
This is GPT-NeoX style RoPE. But in Meta's official model implementation, the model adopts GPT-J style RoPE, which processes query and key vectors in an interleaved way instead of split into two half (as in rotate_half
method).
Meta's official repo implements RoPE as (full code link):
def apply_rotary_emb(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
I'm confused with this difference, since transformers.LlamaModel
can directly load weights converted from the officially released checkpoint, won't this lead to inconsistency in inference results? Is this difference expected?
same confusion
same confusion
@santiweide Params of some layers are re-permuted while converting weights in the official scripts. Check
transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py
Lines 113 to 115 in e42587f
ohhh thank you, we are converting the Megatron weight to ft weight, and we would check the shape of weights then
Awesome, thanks for clarifying this!
Awesome, thanks for clarifying this!
Thanks for the detailed illustration!!!
Thank you @lytning98, your answer saved my life.
May I ask the purpose behind this process?
transformers/src/transformers/models/llama/modeling_llama.py
Lines 177 to 181 in 6cdbd73
I mean, why not use the interleaved pair as in Meta's official llama?
@zphang @ArthurZucker .
Thanks in advance.
Thank you @lytning98, your answer saved my life.
May I ask the purpose behind this process?
transformers/src/transformers/models/llama/modeling_llama.py
Lines 177 to 181 in 6cdbd73
I mean, why not use the interleaved pair as in Meta's official llama?
@zphang @ArthurZucker .
Thanks in advance.
https://discuss.huggingface.co/t/is-llama-rotary-embedding-implementation-correct/44509/2
Few reasons that are already mentioned:
- first and foremost, and I can't stress this enough, licence
- second, eleuther's rope formulation (that we are using) is equivalent, maybe has one less operation that makes it more optimised