The Swin-UNet is a version of the widely used U-Net architecture that combines the windowed self-attention mechanism with the U-Net framework.
The Swin-Transformer builds on the Vision-Transformer by calculating the attention limited to a local window and making use of a shifted windows for providing connections between windows that significantly enhance modeling power of the architecture.
The attention limitation to a local window allows it have a linear computation complexity to input image size against the quadratic of Vision Transformer.
The stacking of Swin-Transformer blocks allows hierarchical feature maps by merging image patches in deeper layers. The shift of the window between consecutive transformer block allows for cross-window connection and thus enabling learning of finer details required in dense prediction tasks such as object detection and semantic segmentation.