About Sparse MHSA

Question

About Sparse MHSA

Closed this issue 2 years ago · 2 comments

How does "Sparse multi-head self-attention" compare to the original in terms of computation complexity, parameters, and GFlops

Answer 1 · 2022-07-30T12:24:47.000Z

The theoretical computational complexity of sparse multi-head self-attention (SMSA) and naive multi-head self-attention (MSA) is the same, and the number of parameters too. Actually, MSA is a special case of SMSA.
However, the FLOPs of SMSA are significantly smaller than those of MSA. In this work, we did not strictly go for the specific values. If you want to know it, I recommend you use thop or ptflops to calculate it.

Answer 2 · 2022-07-30T12:42:28.000Z

Thank you for your reply!