fzh0917/SparseTT

About Sparse MHSA

Closed this issue · 2 comments

How does "Sparse multi-head self-attention" compare to the original in terms of computation complexity, parameters, and GFlops

The theoretical computational complexity of sparse multi-head self-attention (SMSA) and naive multi-head self-attention (MSA) is the same, and the number of parameters too. Actually, MSA is a special case of SMSA.
However, the FLOPs of SMSA are significantly smaller than those of MSA. In this work, we did not strictly go for the specific values. If you want to know it, I recommend you use thop or ptflops to calculate it.

Thank you for your reply!