Question: Sliding window attention
stellanhaglund opened this issue · 3 comments
Are there any plans of trying out sliding window attention like mistral on this repo, or is that more appropriate for a separate fork?
Also if anyone has tried anything with this I’m really interested in that.
The new flash-attention has sliding window build in, however it doesnt stuck with compiling the model. So it is extremely easy to try it as it is but you will end up with slow training. there is an other repo called TinyLlama, where sliding window is an option, but my feeling as it is, is that is slower than this repo with the compile=True. It will be nice if they can implemented it though. I agree with you.
I'm not only interested in the performance side, I'm also interested in if there's any noticeable difference in the output with sliding window attention.
It seems to benefit Mistral a lot.
Mistral Is more data secret sauce than architecture change, it may only be slightly better