Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage
ziyuhuang123 opened this issue · 2 comments
Could someone please clarify the usage of warp_scheduler_barrier_arrive in FA3? Specifically, in the following line:
cutlass::arch::NamedBarrier::arrive(NumMmaThreads, /*id*/);
Is it correct to use NumMmaThreads here? My concern is that a single thread calling arrive increases the barrier counter by NumMmaThreads, while the sync function waits for exactly NumMmaThreads. Wouldn't it be more accurate to write it as:
cutlass::arch::NamedBarrier::arrive(1, /*id*/);
This seems to reflect the intended behavior more appropriately. Could someone confirm?
No NamedBarrier works differently. You can see how Cutlass uses it. Or the doc on bar.sync
from PTX.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-barrier
Thanks! I get it. The threadcount in bar.arrive just means how many threads should participate, but not every time a thread add to the barrier.