Flash attention and attention mask modification. Does the model support flash attention?
Closed this issue · 1 comments
denadai2 commented
Dear authors,
first of all congrats for your idea and paper!!
I have a question about the code. I see here
that in flash attention you do not modify the attention mask. Is it expected?thanks
Yxxxb commented
Hi,
Thank you for your interest.
Since we need to use a 4d attention mask, but the open source flash attention only supports a 2d casual attention mask, we chose the standard sdpa and modified the attention mask based on that.