lucidrains/x-transformers

Question: rotary embeddings and bad length extrapolation

Closed this issue · 1 comments

In my tests, I've uncovered that rotary embeddings don't length extrapolate well.
To be fair, you do mention this in your README.
You suggest using rotary_xpos = True which should fix this but your attention becomes local.

Is this the best way to have good length extrapolation in a transformer network? Or is there a better positional embedding that doesn't suffer from this, yet works with flash attention and key-value mems?

I'll try using rotary_xpos but I don't like the idea of shortening the context length from something potentially really big to something small.

Thank you

Other candidates are Alibi or no embeddings at all. For the last one, in order for it to work, do you need to train with a range of sizes so it can learn to length extrapolate well?