About Sparse MHSA
Closed this issue · 2 comments
ZiMo-Chen commented
How does "Sparse multi-head self-attention" compare to the original in terms of computation complexity, parameters, and GFlops
fzh0917 commented
The theoretical computational complexity of sparse multi-head self-attention (SMSA) and naive multi-head self-attention (MSA) is the same, and the number of parameters too. Actually, MSA is a special case of SMSA.
However, the FLOPs of SMSA are significantly smaller than those of MSA. In this work, we did not strictly go for the specific values. If you want to know it, I recommend you use thop or ptflops to calculate it.
ZiMo-Chen commented
Thank you for your reply!