Dao-AILab/flash-attention

Understanding the Role of arrive in NamedBarrier Synchronization

ziyuhuang123 opened this issue · 1 comments

  1. In the FA3 store function, I observed the following process:

    • Data is stored from registers to shared memory.
    • A sync occurs.
    • Then, data is stored from shared memory to global memory.
  2. This sync is a NamedBarrier sync, but I noticed that no arrive operation is performed:

    • I searched the corresponding barrier ID and confirmed that no arrive is associated with it.
  3. This reminds me of __syncthreads, which translates to PTX as bar.sync and also doesn’t involve an explicit arrive.

  4. This raises the question:

    • Does this imply that arrive is unnecessary for such synchronization scenarios?
  5. However, I noticed that in other parts of FA3, arrive is used.

  6. Therefore, I’m curious:

    • What are the specific conditions or scenarios where arrive is required?

It seems that if we use it as __syncthreads, there's no need for arrive. bar.arrive is meant for use in WASP, isn't it? Although using a barrier for WASP feels odd... doesn't this forcibly require the producer and consumer to have the same participating threads?