Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage

Question

Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage

ziyuhuang123 opened this issue a month ago · 2 comments

Could someone please clarify the usage of warp_scheduler_barrier_arrive in FA3? Specifically, in the following line:

cutlass::arch::NamedBarrier::arrive(NumMmaThreads, /*id*/);

Is it correct to use NumMmaThreads here? My concern is that a single thread calling arrive increases the barrier counter by NumMmaThreads, while the sync function waits for exactly NumMmaThreads. Wouldn't it be more accurate to write it as:

cutlass::arch::NamedBarrier::arrive(1, /*id*/);

This seems to reflect the intended behavior more appropriately. Could someone confirm?

Answer 1 · 2024-12-06T08:44:48.000Z

No NamedBarrier works differently. You can see how Cutlass uses it. Or the doc on bar.sync from PTX.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-barrier

Answer 2 · 2024-12-06T09:07:24.000Z

Thanks! I get it. The threadcount in bar.arrive just means how many threads should participate, but not every time a thread add to the barrier.