question about N-Gram WSA and shifted WSA
Closed this issue · 2 comments
Thanks for your great work! I have a question about N-Gram WSA and shifted WSA:
NSTB consists of N-Gram WSA and shifted WSA. Shifted WSA has already established interactions across windows. Why use N-Gram WSA to build relationships between local windows again, instead of the plain WSA?
As I presented in Fig. 2 of the paper, the plain WSA (w/o N-Gram) can see only 8x8 small local window. Although WSA of Swin Transformer can consider across windows, when a self-attention is operated in a certain block, the plain WSA cannot see others outside the window. I figured out this problem leads to a limited performance.
So I propose the N-Gram WSA. As I specified in the paper, Uni-Gram embedding allows adjacent windows (denoted as X1) to a certain local window (denoted as X0 ) to be computed with X0 by self-attention. Fig. 2 w/ N-Gram images show that even when computing self-attention within X0, the information from X1 can be involved.
If you have any follow-up questions, please let me know.
Thanks.
Thank you for your answer!