Grid Partition
DDxk369 opened this issue · 1 comments
Thanks to the authors for sharing Maxvit open source, I really enjoyed this project and studied it for a few days. However, I didn't understand this part of the work on the Grid Partition. In my opinion, it looks almost the same as SWIN V1, so how does it accomplish the grid operation shown below? Looking forward to your advice, thank you
Hi @DDxk369,
The grid partition is somewhat similar to the application of a strided convolution. Attention is not performed within directly neighboring pixels but inside a strided window. For more details, I would refer to section 3.2 (Multi-axis Attention) of the original paper or this function. Regarding the similarity to the Swin Transformer approach, MaxViT uses a strided window, whereas Swin uses a shifted window approach in conjunction to the standard window attention (also utilized by MaxViT).
Cheers,
Christoph