labmlai/annotated_deep_learning_paper_implementations

question about RoPE code

yukyeongmin opened this issue ยท 3 comments

x_rope = (x_rope * self.cos_cached[:x.shape[0]]) + (neg_half_x * self.sin_cached[:x.shape[0]])

self.cos_cached and self.sin_cached have same shape of x, aren't they??

So if this line intended to compute RoPE with partial of x which means x[...,:self.d],
i think this line should be
x_rope = (x_rope * self.cos_cached[...,:self.d) + (neg_half_x * self.sin_cached[...,:self.d])

please let me know if i'm wrong

You are correct that self.cos_cached and self.sin_cached have same shape of x.

And when it comes to the modication, that is also correct because it would ensure that the rotary embeddings are applied only to the subset of features specified by self.d

vpj commented

They have the similar shapes. The truncation of cached sin/cos to x.shape[0] is truncating them to sequence length. Because the sequence lengths (number of tokens per sample) changes.

Thanks for reply!! @vpj @nagamonish

Didn't you have any problems running that code? The original code didn't work for me with different shape of input. And i thought it's about grammar.