Dao-AILab/flash-attention

Understanding sync and arrive in FA3 Store Function

ziyuhuang123 opened this issue · 0 comments

Does anyone understand why the number of sync operations in the FA3 store function (

cutlass::arch::NamedBarrier::arrive(NumEpilogueThreads + cutlass::NumThreadsPerWarp,
) is 256+32 (within a single warp, with all threads arriving)? Why does the non-varlen case require arrive before sync, while the varlen case only needs sync without arrive?
28776ba74d6b485251b90479ca954da