Awesome PR's Welcome

Transformer-Based Visual Segmentation: A Survey

arXiv, 2023
Xiangtai Li · Henghui Ding · Wenwei Zhang · Haobo Yuan · Guangliang Cheng
Jiangmiao Pang . Kai Chen . Ziwei Liu . Chen Change Loy

arXiv PDF S-Lab Project Page


This repo is used for recording, tracking and benchmarking several recent transformer-based visual segmentation methods, as a supplement for our survey.
If you find any work missing or have any suggestions (papers, implementations and other resources), feel free to pull requests. We will add the missing papers in this repo ASAP.

🔥Highlight!!

[1], Previous transformer surveys divide the methods by the different tasks and settings. Different from them, we re-visit and group the existing transformer-based methods from the technical perspective.

[2], We survey the methods in two parts: one for the main stream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.

[3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets.

[4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query.

Introduction

In this survey, we present the first detailed survey on Transformer-Based Segmentation.

Alt Text

Summary of Contents

Methods: A Survey

Meta-Architecture

Year Venue Acronym Paper Title Code/Project
2020 ECCV DETR End-to-End Object Detection with Transformers Code
2021 ICLR Deformable DETR Deformable DETR: Deformable Transformers for End-to-End Object Detection Code
2021 CVPR Max-Deeplab MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers Code
2021 NeurIPS MaskFormer MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation Code
2021 NeurIPS K-Net K-Net: Towards Unified Image Segmentation Code
2023 CVPR Lite-DETR Lite detr: An interleaved multi-scale encoder for efficient detr Code

Strong Representation

Better ViTs Design

Year Venue Acronym Paper Title Code/Project
2021 CVPR SETR Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers Code
2021 ICCV MviT-V1 Multiscale vision transformers Code
2022 CVPR MviT-V2 MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
2021 NeurIPS XCIT Xcit: Crosscovariance image transformers Code
2021 ICCV Pyramid VIT Pyramid vision transformer: A versatile backbone for dense prediction without convolutions Code
2021 ICCV CorssViT CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification Code
2021 ICCV CoaT Co-Scale Conv-Attentional Image Transformers Code
2022 CVPR MPViT MPViT: Multi-Path Vision Transformer for Dense Prediction Code
2022 NeurIPS SegViT SegViT: Semantic Segmentation with Plain Vision Transformers Code
2022 arxiv RSSeg Representation Separation for SemanticSegmentation with Vision Transformers N/A

Hybrid CNNs/Transformers/MLPs

Year Venue Acronym Paper Title Code/Project
2021 ICCV Swin Swin transformer: Hierarchical vision transformer using shifted windows Code
2022 CVPR Swin-v2 Swin Transformer V2: Scaling Up Capacity and Resolution Code
2021 NeurIPS Segformer SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Code
2022 CVPR CMT CMT: Convolutional Neural Networks Meet Vision Transformers Code
2021 NeurIPS Twins Twins: Revisiting the Design of Spatial Attention in Vision Transformers Code
2021 ICCV CvT CvT: Introducing Convolutions to Vision Transformers Code
2021 NeurIPS Vitae Vitae: Vision transformer advanced by exploring intrinsic inductive bias Code
2022 CVPR ConvNext A ConvNet for the 2020s Code
2022 NeurIPS SegNext SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation Code
2022 CVPR PoolFormer PoolFormer: MetaFormer Is Actually What You Need for Vision Code
2022 arxiv STM Demystify Transformers & Convolutions in Modern Image Deep Networks Code

Self-Supervised Learning

Year Venue Acronym Paper Title Code/Project
2021 ICCV MOCOV3 An Empirical Study of Training Self-Supervised Vision Transformers Code
2022 ICLR Beit Beit: Bert pre-training of image transformers Code
2022 CVPR MaskFeat Masked Feature Prediction for Self-Supervised Visual Pre-Training Code
2022 CVPR MAE Masked Autoencoders Are Scalable Vision Learners Code
2022 NeurIPS ConvMAE MCMAE: Masked Convolution Meets Masked Autoencoders Code
2023 ICLR Spark SparK: the first successful BERT/MAE-style pretraining on any convolutional networks Code
2022 CVPR FLIP Scaling Language-Image Pre-training via Masking Code
2023 arxiv ConvNeXt V2 ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code

Interaction Design in Decoder

Improved Cross Attention Design

Year Venue Acronym Paper Title Code/Project
2021 CVPR Sparse R-CNN Sparse R-CNN: End-to-End Object Detection with Learnable Proposals Code
2022 CVPR AdaMixer AdaMixer: A Fast-Converging Query-Based Object Detector Code
2021 CVPR MaX-DeepLab MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers Code
2021 NeurIPS K-Net K-Net: Towards Unified Image Segmentation Code
2022 CVPR Mask2Former Masked-attention Mask Transformer for Universal Image Segmentation Code
2022 ECCV kMaX-DeepLab k-means Mask Transformer Code
2021 ICCV QueryInst Instances as queries Code
2021 arxiv ISTR ISTR: End-to-End Instance Segmentation via Transformers Code
2021 NeurIPS SOLQ Solq: Segmenting objects by learning queries Code
2022 CVPR Panoptic Segformer Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers Code
2022 CVPR CMT-Deeplab CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation N/A
2022 CVPR SparseInst Sparse Instance Activation for Real-Time Instance Segmentation Code
2022 CVPR SAM-DETR Accelerating DETR Convergence via Semantic-Aligned Matching Code
2021 ICCV SMCA-DETR Fast Convergence of DETR with Spatially Modulated Co-Attention Code
2021 BMVC ACT-DETR End-to-End Object Detection with Adaptive Clustering Transformer Code
2021 ICCV Dynamic DETR Dynamic DETR: End-to-End Object Detection with Dynamic Attention N/A
2022 ICLR Sparse DETR Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity Code
2023 CVPR FastInst FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation Code

Spatial-Temporal Cross Attention Design

Year Venue Acronym Paper Title Code/Project
2021 CVPR VisTR VisTR: End-to-End Video Instance Segmentation with Transformers Code
2021 NeurIPS IFC Video instance segmentation using inter-frame communication transformers Code
2022 CVPR SlotVPS Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation N/A
2022 CVPR TubeFormer-DeepLab TubeFormer-DeepLab: Video Mask Transformer N/A
2022 CVPR Video K-Net Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation Code
2022 CVPR TeViT Temporally efficient vision transformer for video instance segmentation Code
2022 ECCV Seqformer SeqFormer: Sequential Transformer for Video Instance Segmentation Code
2022 arxiv Mask2Former-VIS Mask2Former for Video Instance Segmentation Code
2022 PAMI TransVOD TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers Code
2022 NeurIPS VITA VITA: Video Instance Segmentation via Object Token Association Code

Optimizing Object Query

Adding Position Information into Query

Year Venue Acronym Paper Title Code/Project
2021 ICCV Conditional-DETR Conditional DETR for Fast Training Convergence Code
2022 arxiv Conditional-DETR-v2 Conditional detr v2:Efficient detection transformer with box queries Code
2022 AAAI Anchor DETR Anchor detr: Query design for transformer-based detector Code
2022 ICLR DAB-DETR DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR Code
2021 arxiv Efficient DETR Efficient detr: improving end-to-end object etector with dense prior N/A

Adding Extra Supervision into Query

Year Venue Acronym Paper Title Code/Project
2022 ECCV DE-DETR Towards Data-Efficient Detection Transformers Code
2022 CVPR DN-DETR Dndetr:Accelerate detr training by introducing query denoising Code
2023 ICLR DINO DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection Code
2023 CVPR Mp-Former Mp-former: Mask-piloted transformer for image segmentation Code
2023 CVPR Mask-DINO Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation Code
2022 NeurIPS N/A Learning equivariant segmentation with instance-unique querying Code
2023 CVPR H-DETR DETRs with Hybrid Matching Code
2022 arxiv Group-DETR Group detr: Fast detr training with group-wise one-to-many assignment N/A
2022 arxiv Co-DETR Detrs with collaborative hybrid assignments training Code

Using Query For Association

Query as Instance Association

Year Venue Acronym Paper Title Code/Project
2022 CVPR TrackFormer TrackFormer: Multi-Object Tracking with Transformer Code
2021 arxiv TransTrack TransTrack: Multiple Object Tracking with Transformer Code
2022 ECCV MOTR MOTR: End-to-End Multiple-Object Tracking with TRansformer Code
2022 NeurIPS MinVIS MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training Code
2022 ECCV IDOL In defense of online models for video instance segmentation Code
2022 CVPR Video K-Net Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation Code
2023 CVPR GenVIS A Generalized Framework for Video Instance Segmentation Code
2023 arXiv Tube-Link Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation Code
2023 arXiv Video-kMaX Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation N/A

Query as Linking Multi-Tasks

Year Venue Acronym Paper Title Code/Project
2022 ECCV Panoptic-PartFormer Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation Code
2022 ECCV PolyphonicFormer PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation Code
2022 CVPR PanopticDepth Panopticdepth: A unified framework for depth-aware panoptic segmentation Code
2022 ECCV Fashionformer Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition Code
2022 ECCV InvPT InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding Code
2023 CVPR UNINEXT Universal Instance Perception as Object Discovery and Retrieval Code

Conditional Query Generation

Conditional Query Fusion on Language Features

Year Venue Acronym Paper Title Code/Project
2021 ICCV VLT Vision-Language Transformer and Query Generation for Referring Segmentation Code
2022 CVPR LAVT Lavt: Language-aware vision transformer for referring image segmentation Code
2022 CVPR Restr Restr:Convolution-free referring image segmentation using transformers N/A
2022 CVPR Cris Cris: Clip-driven referring image segmentation Code
2022 CVPR MTTR End-to-End Referring Video Object Segmentation with Multimodal Transformers Code
2022 CVPR LBDT Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation Code
2022 CVPR ReferFormer Language as queries for referring video object segmentation Code

Conditional Query Fusion on Cross Image Features

Year Venue Acronym Paper Title Code/Project
2021 NeurIPS CyCTR Few-Shot Segmentation via Cycle-Consistent Transformer Code
2022 CVPR MatteFormer MatteFormer: Transformer-Based Image Matting via Prior-Tokens Code
2022 ECCV Segdeformer A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining Code
2022 arxiv StructToken StructToken : Rethinking Semantic Segmentation with Structural Prior N/A
2022 NeurIPS MM-Former Mask Matching Transformer for Few-Shot Segmentation Code
2022 ECCV AAFormer Adaptive Agent Transformer for Few-shot Segmentation N/A
2023 arxiv ReferenceTwice Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation Code

Tuning Foundation Models

Vision Adapter

Year Venue Acronym Paper Title Code/Project
2022 CVPR CoCoOp Conditional Prompt Learning for Vision-Language Models Code
2022 ECCV Tip-Adapter Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification Code
2022 ECCV EVL Frozen CLIP Models are Efficient Video Learners Code
2023 ICLR ViT-Adapter Vision Transformer Adapter for Dense Predictions Code
2022 CVPR DenseCLIP DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting Code
2022 CVPR CLIPSeg Image Segmentation Using Text and Image Prompts Code
2023 CVPR OneFormer OneFormer: One Transformer to Rule Universal Image Segmentation Code

Open Vocabulary Learning

Year Venue Acronym Paper Title Code/Project
2021 CVPR OVR-CNN Open-Vocabulary Object Detection Using Captions Code
2022 ICLR ViLD Open-vocabulary Object Detection via Vision and Language Knowledge Distillation Code
2022 ECCV Detic Detecting Twenty-thousand Classes using Image-level Supervision Code
2022 ECCV OV-DETR Open-Vocabulary DETR with Conditional Matching Code
2023 ICLR F-VLM F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models Code
2022 ECCV MViT Class-agnostic Object Detection with Multi-modal Transformer Code
2022 ECCV OpenSeg Scaling Open-Vocabulary Image Segmentation with Image-Level Labels Code
2022 ICLR LSeg Language-driven Semantic Segmentation Code
2022 ECCV SimSeg A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model Code
2022 ECCV DenseCLIP Extract Free Dense Labels from CLIP Code
2021 ICCV UVO Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation Project
2023 arXiv CGG Betrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Code
2022 TPAMI ES Open-World Entity Segmentation Code
2022 CVPR OW-DETR OW-DETR: Open-world Detection Transformer Code
2023 CVPR PROB PROB: Probabilistic Objectness for Open World Object Detection Code

Related Domains and Beyond

Point Cloud Segmentation

Year Venue Acronym Paper Title Code/Project
2021 ICCV Point Transformer Point Transformer N/A
2021 CVM PCT PCT: Point cloud transformer Code
2022 CVPR Stratified Transformer Stratified Transformer for 3D Point Cloud Segmentation Code
2022 CVPR Point-BERT Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling Code
2022 ECCV Point-MAE Masked Autoencoders for Point Cloud Self-supervised Learning Code
2022 NeurIPS Point-M2AE Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training Code
2022 ICRA Mask3D Mask3D for 3D Semantic Instance Segmentation Code
2023 AAAI SPFormer Superpoint Transformer for 3D Scene Instance Segmentation Code
2023 AAAI PUPS PUPS: Point Cloud Unified Panoptic Segmentation N/A

Domain-aware Segmentation

Year Venue Acronym Paper Title Code/Project
2022 CVPR DAFormer DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation Code
2022 ECCV HRDA HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation Code
2023 CVPR MIC MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation Code
2021 ACM MM SFA Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers Code
2023 CVPR DA-DETR DA-DETR: Domain Adaptive Detection Transformer with Information Fusion N/A
2022 ECCV MTTrans MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer Code
2022 arXiv Sentence-Seg The devil is in the labels: Semantic segmentation from sentences N/A
2023 ICLR LMSeg LMSeg: Language-guided Multi-dataset Segmentation N/A
2022 CVPR UniDet Simple multi-dataset detection Code
2023 CVPR Detection Hub Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding N/A
2022 CVPR WD2 Unifying Panoptic Segmentation for Autonomous Driving Data
2023 arXiv TarVIS TarViS: A Unified Approach for Target-based Video Segmentation N/A

Label and Model Efficient Segmentation

Year Venue Acronym Paper Title Code/Project
2022 CVPR MCTformer Multi-class Token Transformer for Weakly Supervised Semantic Segmentation Code
2020 CVPR PCM Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation Code
2022 ECCV ViT-PCM Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation Code
2021 ICCV DINO Emerging Properties in Self-Supervised Vision Transformers Code
2021 BMVC LOST Localizing Objects with Self-Supervised Transformers and no Labels Code
2022 ICLR STEGO Unsupervised Semantic Segmentation by Distilling Feature Correspondences Code
2022 NeurIPS ReCo ReCo: Retrieve and Co-segment for Zero-shot Transfer Code
2022 arXiv MaskDistill Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation N/A
2022 CVPR FreeSOLO FreeSOLO: Learning to Segment Objects without Annotations Code
2023 CVPR CutLER Cut and Learn for Unsupervised Object Detection and Instance Segmentation Code
2022 CVPR TokenCut Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut Code
2022 ICLR MobileViT MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer Code
2023 arXiv EMO Rethinking Mobile Block for Efficient Neural Models Code
2022 CVPR TopFormer TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation Code
2023 ICLR SeaFormer SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation Code

Class Agnostic Segmentation and Tracking

Year Venue Acronym Paper Title Code/Project
2022 CVPR Transfiner Mask Transfiner for High-Quality Instance Segmentation Code
2022 ECCV VMT Video Mask Transfiner for High-Quality Video Instance Segmentation Code
2022 arXiv SimpleClick SimpleClick: Interactive Image Segmentation with Simple Vision Transformers Code
2023 ICLR PatchDCT PatchDCT: Patch Refinement for High Quality Instance Segmentation Code
2019 ICCV STM Video Object Segmentation using Space-Time Memory Networks Code
2021 NeurIPS AOT Associating Objects with Transformers for Video Object Segmentation Code
2021 NeurIPS STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Code
2022 ECCV XMem XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model Code
2022 CVPR PCVOS Per-Clip Video Object Segmentation Code
2023 CVPR N/A Look Before You Match: Instance Understanding Matters in Video Object Segmentation N/A

Medical Image Segmentation

Year Venue Acronym Paper Title Code/Project
2021 arXiv TransUNet TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation Code
2022 ECCV Workshop Swin-Unet Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation Code
2021 MICCAI TransFuse TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation Code
2022 WACV UNETR UNETR: Transformers for 3D Medical Image Segmentation Code

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{li2023transformer,
    author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},
    title={Transformer-Based Visual Segmentation: A Survey},
    journal={arXiv pre-print},
    year={2023}
  }

Contact

xiangtai.li@ntu.edu.sg 
lxtpku@pku.edu.cn

Related Repo For Segmentation and Detection

Attention Model Repo by Min-Hung (Steve) Chen.

Detection Trasnformer Repo by IDEA.