Dao-AILab/flash-attention

Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage

ziyuhuang123 opened this issue · 2 comments

Could someone please clarify the usage of warp_scheduler_barrier_arrive in FA3? Specifically, in the following line:

cutlass::arch::NamedBarrier::arrive(NumMmaThreads, /*id*/);

Is it correct to use NumMmaThreads here? My concern is that a single thread calling arrive increases the barrier counter by NumMmaThreads, while the sync function waits for exactly NumMmaThreads. Wouldn't it be more accurate to write it as:

cutlass::arch::NamedBarrier::arrive(1, /*id*/);

This seems to reflect the intended behavior more appropriately. Could someone confirm?

No NamedBarrier works differently. You can see how Cutlass uses it. Or the doc on bar.sync from PTX.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-barrier

Thanks! I get it. The threadcount in bar.arrive just means how many threads should participate, but not every time a thread add to the barrier.