Question: rotary embeddings and bad length extrapolation

Question

Question: rotary embeddings and bad length extrapolation

Closed this issue 18 days ago · 1 comments

In my tests, I've uncovered that rotary embeddings don't length extrapolate well.
To be fair, you do mention this in your README.
You suggest using rotary_xpos = True which should fix this but your attention becomes local.

Is this the best way to have good length extrapolation in a transformer network? Or is there a better positional embedding that doesn't suffer from this, yet works with flash attention and key-value mems?

I'll try using rotary_xpos but I don't like the idea of shortening the context length from something potentially really big to something small.

Thank you

Answer 1 · 2024-02-19T10:24:13.000Z

Other candidates are Alibi or no embeddings at all. For the last one, in order for it to work, do you need to train with a range of sizes so it can learn to length extrapolate well?