Awesome Vision Transformer Collection

Variants of Vision Transformer and Vision Transformer for Downstream Tasks

author: Runwei Guan

affiliation: University of Liverpool / JITRI-Institute of Deep Perception Technology

email: thinkerai@foxmail.com / Runwei.Guan@liverpool.ac.uk / guanrunwei@idpt.org

Image Backbone

  • Vision Transformer paper code
  • Swin Transformer paper code
  • Swin Transformer V2: Scaling Up Capacity and Resolution paper code
  • DVT paper code
  • PVT paper code
  • Lite Vision Transformer: LVT paper
  • PiT paper code
  • Twins paper code
  • TNT paper code
  • Mobile-ViT paper code
  • Cross-ViT paper code
  • LeViT paper code
  • ViT-Lite paper
  • Refiner paper code
  • DeepViT paper code
  • CaiT paper code
  • LV-ViT paper code
  • DeiT paper code
  • CeiT paper code
  • BoTNet paper
  • ViTAE paper
  • Visformer: The Vision-Friendly Transformer paper code
  • Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training paper
  • AdaViT: Adaptive Tokens for Efficient Vision Transformer paper
  • Improved Multiscale Vision Transformers for Classification and Detection paper
  • Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding paper
  • Towards End-to-End Image Compression and Analysis with Transformers paper
  • MPViT: Multi-Path Vision Transformer for Dense Prediction paper
  • Lite Vision Transformer with Enhanced Self-Attention paper
  • PolyViT: Co-training Vision Transformers on Images, Videos and Audio paper
  • MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation paper
  • ELSA: Enhanced Local Self-Attention for Vision Transformer paper
  • Vision Transformer for Small-Size Datasets paper
  • SimViT: Exploring a Simple Vision Transformer with sliding windows paper
  • SPViT: Enabling Faster Vision Transformers via Soft Token Pruning paper
  • Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space paper
  • Vision Transformer with Deformable Attention paper code
  • PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture paper
  • QuadTree Attention for Vision Transformers paper code
  • TerViT: An Efficient Ternary Vision Transformer paper
  • BViT: Broad Attention based Vision Transformer paper
  • CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction paper
  • EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers paper
  • Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention paper
  • Coarse-to-Fine Vision Transformer paper
  • ViT-P: Rethinking Data-efficient Vision Transformers from Locality paper
  • MPViT: Multi-Path Vision Transformer for Dense Prediction paper
  • Event Transformer paper
  • DaViT: Dual Attention Vision Transformers paper
  • LightViT: Towards Light-Weight Convolution-Free Vision Transformers paper
  • UniNet: Unified Architecture Search with Convolution, Transformer, and MLP paper
  • Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning paper
  • EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications paper

Multi-label Classification

  • Graph Attention Transformer Network for Multi-Label Image Classification paper

Point Cloud Processing

  • Point Cloud Transformer paper
  • Point Transformer paper
  • Fast Point Transformer paper
  • Adaptive Channel Encoding Transformer for Point Cloud Analysis paper
  • PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
  • Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction paper
  • LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling paper
  • Geometric Transformer for Fast and Robust Point Cloud Registration paper
  • HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud paper

Video Processing

  • Video Transformers: A Survey paper
  • ViViT: A Video Vision Transformer paper
  • Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos paper
  • LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach paper
  • Video Joint Modelling Based on Hierarchical Transformer for Co-summarization paper
  • InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer paper
  • TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers paper
  • Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning paper
  • Multiview Transformers for Video Recognition paper
  • MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition paper
  • Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval paper
  • A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
  • Learning Trajectory-Aware Transformer for Video Super-Resolution paper
  • Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper

Model Compression

  • A Unified Pruning Framework for Vision Transformers paper
  • Multi-Dimensional Model Compression of Vision Transformer paper
  • Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression paper

Transfer Learning & Pretraining

  • Pre-Trained Image Processing Transformer paper code
  • UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper code
  • BEVT: BERT Pretraining of Video Transformers paper
  • Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
  • On Efficient Transformer and Image Pre-training for Low-level Vision paper
  • Pre-Training Transformers for Domain Adaptation paper
  • RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training paper
  • Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion paper
  • DiT: Self-supervised Pre-training for Document Image Transformer paper
  • Underwater Image Enhancement Using Pre-trained Transformer paper

Multi-Modal

  • Multi-Modal Fusion Transformer for End-to-End Autonomous Driving paper
  • Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval paper
  • LAVT: Language-Aware Vision Transformer for Referring Image Segmentation paper
  • MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection paper
  • Visual-Semantic Transformer for Scene Text Recognition paper
  • Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
  • LaTr: Layout-Aware Transformer for Scene-Text VQA paper
  • Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding paper
  • Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation paper
  • Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg paper
  • On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering paper
  • DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers paper
  • CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers paper
  • VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer paper
  • Knowledge Amalgamation for Object Detection with Transformers paper
  • Are Multimodal Transformers Robust to Missing Modality? paper
  • Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
  • Video Graph Transformer for Video Question Answering paper

Detection

  • YOLOS: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection paper code
  • WB-DETR: Transformer-Based Detector without Backbone paper
  • UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper
  • TSP: Rethinking Transformer-based Set Prediction for Object Detection paper
  • DETR paper code
  • Deformable DETR paper code
  • DN-DETR: Accelerate DETR Training by Introducing Query DeNoising paper code
  • Rethinking Transformer-Based Set Prediction for Object Detection paper
  • End-to-End Object Detection with Adaptive Clustering Transformer paper
  • An End-to-End Transformer Model for 3D Object Detection paper
  • End-to-End Human Object Interaction Detection with HOI Transformer paper code
  • Adaptive Image Transformer for One-Shot Object Detection paper
  • Improving 3D Object Detection With Channel-Wise Transformer paper
  • TransPose: Keypoint Localization via Transformer paper
  • Voxel Transformer for 3D Object Detection paper
  • Embracing Single Stride 3D Object Detector with Sparse Transformer paper
  • OW-DETR: Open-world Detection Transformer paper
  • A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
  • Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence paper
  • Voxel Transformer for 3D Object Detection paper
  • Short Range Correlation Transformer for Occluded Person Re-Identification paper
  • TransVPR: Transformer-based place recognition with multi-level attention aggregation paper
  • Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond paper
  • Arbitrary Shape Text Detection using Transformers paper
  • A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network paper
  • A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
  • Knowledge Amalgamation for Object Detection with Transformers paper
  • SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection paper
  • POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition paper
  • PSTR: End-to-End One-Step Person Search With Transformers paper
  • Scaling Novel Object Detection with Weakly Supervised Detection Transformers paper
  • OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers paper
  • Exploring Plain Vision Transformer Backbones for Object Detection paper

Segmentation

  • Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation paper code
  • Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention paper code
  • MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers paper code
  • Line Segment Detection Using Transformers without Edges paper
  • VisTR: End-to-End Video Instance Segmentation with Transformers paper code
  • SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers paper code
  • Segmenter: Transformer for Semantic Segmentation paper
  • Fully Transformer Networks for Semantic Image Segmentation paper
  • SOTR: Segmenting Objects with Transformers paper code
  • GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation paper
  • Masked-attention Mask Transformer for Universal Image Segmentation paper
  • A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
  • iSegFormer: Interactive Image Segmentation with Transformers paper
  • SOIT: Segmenting Objects with Instance-Aware Transformers paper
  • SeMask: Semantically Masked Transformers for Semantic Segmentation paper
  • Siamese Network with Interactive Transformer for Video Object Segmentation paper
  • Pyramid Fusion Transformer for Semantic Segmentation paper
  • Swin transformers make strong contextual encoders for VHR image road extraction paper
  • Transformers in Action:Weakly Supervised Action Segmentation paper
  • Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation paper
  • Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
  • Contextual Attention Network: Transformer Meets U-Net paper
  • TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation paper

Pose Estimation

  • Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation paper
  • HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation paper
  • End-to-End Human Pose and Mesh Reconstruction with Transformers paper code
  • PE-former: Pose Estimation Transformer paper
  • Pose Recognition with Cascade Transformers paper code
  • Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer code
  • Geometry-Contrastive Transformer for Generalized 3D Pose Transfer paper
  • Temporal Transformer Networks with Self-Supervision for Action Recognition paper
  • Co-training Transformer with Videos and Images Improves Action Recognition paper
  • DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer paper
  • Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition paper
  • Motion-Aware Transformer For Occluded Person Re-identification paper
  • HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders paper
  • ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers paper
  • Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding paper
  • Spatial Transformer Network on Skeleton-based Gait Recognition paper

Tracking and Trajectory Prediction

  • Transformer Tracking paper code
  • Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking paper code
  • MOTR: End-to-End Multiple-Object Tracking with TRansformer paper code
  • SwinTrack: A Simple and Strong Baseline for Transformer Tracking paper
  • Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network paper
  • PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
  • Efficient Visual Tracking with Exemplar Transformers paper
  • TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer paper

Generative Model and Denoising

  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds paper
  • Spatial-Temporal Transformer for Dynamic Scene Graph Generation paper
  • THUNDR: Transformer-Based 3D Human Reconstruction With Markers paper
  • DoodleFormer: Creative Sketch Drawing with Transformers paper
  • Image Transformer paper
  • Taming Transformers for High-Resolution Image Synthesis paper code
  • TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up code
  • U2-Former: A Nested U-shaped Transformer for Image Restoration paper
  • Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers paper
  • SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers paper
  • StyleSwin: Transformer-based GAN for High-resolution Image Generation paper
  • Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction paper
  • SGTR: End-to-end Scene Graph Generation with Transformer paper
  • Flow-Guided Sparse Transformer for Video Deblurring paper
  • Spherical Transformer paper
  • MaskGIT: Masked Generative Image Transformer paper
  • Entroformer: A Transformer-based Entropy Model for Learned Image Compression paper
  • UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation paper
  • Stripformer: Strip Transformer for Fast Image Deblurring paper
  • Vision Transformers for Single Image Dehazing paper
  • Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper

Self-Supervised Learning

  • Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning paper code
  • iGPT paper code
  • An Empirical Study of Training Self-Supervised Vision Transformers paper code
  • Self-supervised Video Transformer paper
  • TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning paper
  • TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning paper
  • Transformers in Action:Weakly Supervised Action Segmentation paper
  • Motion-Aware Transformer For Occluded Person Re-identification paper
  • Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
  • Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut paper
  • Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers paper
  • Multi-class Token Transformer for Weakly Supervised Semantic Segmentation paper
  • Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
  • DiT: Self-supervised Pre-training for Document Image Transformer paper
  • Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
  • DILEMMA: Self-Supervised Shape and Texture Learning with Transformers paper

Depth and Height Estimation

  • Disentangled Latent Transformer for Interpretable Monocular Height Estimation paper
  • Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
  • SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification paper

Explainable

  • Development and testing of an image transformer for explainable autonomous driving systems paper
  • Transformer Interpretability Beyond Attention Visualization paper code
  • How Do Vision Transformers Work? paper
  • eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation paper

Robustness

  • Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding paper

Deep Reinforcement Learning

  • Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels paper

Calibration

  • CTRL-C: Camera Calibration TRansformer With Line-Classification paper code

Radar

  • Learning class prototypes from Synthetic InSAR with Vision Transformers paper
  • Radar Transformer paper

Traffic

  • SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers paper

AI Medicine

  • Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer paper
  • 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis paper
  • Hformer: Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks paper
  • MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification paper
  • MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer paper
  • Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge paper
  • D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation paper
  • RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark paper
  • Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images paper
  • Swin Transformer for Fast MRI paper code
  • Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? paper
  • ViTBIS: Vision Transformer for Biomedical Image Segmentation paper
  • SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation paper
  • Improving Across-Dataset Brain Tissue Segmentation Using Transformer paper
  • Brain Cancer Survival Prediction on Treatment-naive MRI using Deep Anchor Attention Learning with Vision Transformer paper
  • Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers paper
  • AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation paper
  • Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification paper
  • Characterizing Renal Structures with 3D Block Aggregate Transformers paper
  • Multimodal Transformer for Nursing Activity Recognition paper
  • RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment paper
  • Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays paper

Hardware

  • VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer paper