Can RWKV beat Flash Attention?

Question

Can RWKV beat Flash Attention?

yxchng opened this issue 2 months ago · 1 comments

I have been experimenting with RWKV v4 and v4neo but somehow it is using much more memory (about 2x) than my LM that uses Flash Attention. Not sure what I am doing wrong. Is this expected?

Answer 1 · 2024-04-16T09:52:43.000Z

Try v5 first. What's your model size, bsz, ctxlen