/Awesome-Transformer-for-Vision-Recognition

A comprehensive paper list of Transformer & Attention for Vision Recognition / Foundation Model, including papers, codes, and related websites.

Awesome-Transformer-for-Vision-Recognition / Foundation-Model Awesome

This repo contains a comprehensive paper list of Transformer & Attention for Vision Recognition / Foundation Model, including papers, Codes, and related websites. (Actively keep updating)

If you own or find some overlooked papers, you can add it to this document by pull request (recommended).

Vision Recognition / Foundation-Model / Backbone

2023

  • RetNet: Retentive Network: A Successor to Transformer for Large Language Models, Arxiv, 2023 (Microsoft). [Paper][Code]
  • GPViT: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (University of Edinburgh, Scotland + UCSD). [Paper][Code]
  • CPVT: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (Meituan). [Paper][Code]
  • LipsFormer: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (IDEA, China). [Paper][Code]
  • BiFormer: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (CUHK). [Paper][Code]
  • AbSViT: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (Berkeley). [Paper][Code][Website]
  • DependencyViT: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (MIT). [Paper][Code]
  • ResFormer: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (Fudan). [Paper][Code]
  • SViT: "Vision Transformer with Super Token Sampling", CVPR, 2023 (CAS). [Paper]
  • PaCa-ViT: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (NC State). [Paper][Code]
  • GC-ViT: "Global Context Vision Transformers", ICML, 2023 (NVIDIA). [Paper][Code]
  • MAGNETO: "MAGNETO: A Foundation Transformer", ICML, 2023 (Microsoft). [Paper]
  • CrossFormer++: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (Zhejiang University). [Paper][Code]
  • QFormer: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (The University of Sydney). [Paper][Code]
  • ViT-Calibrator: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (Zhejiang University). [Paper]
  • SpectFormer: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (Microsoft). [Paper][Code][Website]
  • UniNeXt: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (Alibaba). [Paper]
  • CageViT: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (Southern University of Science and Technology). [Paper]
  • ------: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (UIUC). [Paper]
  • 2-D-SSM: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (Tel Aviv). [Paper][Code]
  • Token-Pooling: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (Apple). [Paper]
  • Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code]
  • ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
  • ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
  • HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
  • ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][Code]
  • HiViT: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (CAS). [Paper][Code]
  • STViT: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (Alibaba). [Paper][Code]
  • SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (MIT). [Paper][Website]
  • Slide-Transformer: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (Tsinghua University). [Paper][Code]
  • RIFormer: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (Shanghai AI Lab). [Paper][Code][Website]
  • EfficientViT: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (Microsoft). [Paper][Code]
  • Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (Meta). [Paper]
  • ViT-Ti: "RGB no more: Minimally-deCoded JPEG Vision Transformers", CVPR, 2023 (UMich). [Paper]
  • Sparsifiner: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (University of Toronto). [Paper]
  • ------: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (Baidu). [Paper]
  • ElasticViT: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", arXiv, 2023 (Microsoft). [Paper]
  • SeiT: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", arXiv, 2023 (NAVER). [Paper][Code]
  • FastViT: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", arXiv, 2023 (Apple). [Paper]
  • CloFormer: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (CAS). [Paper]
  • Quadformer: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (Tel Aviv). [Paper][Code]
  • SparseFormer: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (NUS). [Paper][Code]
  • EMO: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (Tencent). [Paper][Code]
  • SoViT: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", arXiv, 2023 (DeepMind). [Paper]
  • FAT: "Lightweight Vision Transformer with Bidirectional Interaction", arXiv, 2023 (CAS). [Paper][Code]
  • ByteFormer: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (Apple). [Paper]
  • ------: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (Jilin University). [Paper]
  • FasterViT: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (NVIDIA). [Paper]
  • NextViT: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (Baidu). [Paper]
  • SkipAt: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (Qualcomm). [Paper]
  • SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][Code]
  • SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][Code]
  • MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]
  • InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (Shanghai AI Laboratory). [Paper][Code]
  • PSLT: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (Sun Yat-sen University). [Paper][Website]
  • SwiftFormer: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", arXiv, 2023 (MBZUAI). [Paper][Code]
  • model-soup: "Revisiting adapters with adversarial training", ICLR, 2023 (DeepMind). [Paper]
  • ------: "Budgeted Training for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
  • RobustCNN: "Can CNNs Be More Robust Than Transformers------", ICLR, 2023 (UC Santa Cruz + JHU). [Paper][Code]
  • DMAE: "Denoising Masked AutoEnCoders are Certifiable Robust Vision Learners", ICLR, 2023 (Peking). [Paper][Code]
  • TGR: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (CUHK). [Paper][Code]
  • TrojViT: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (Indiana University Bloomington). [Paper]
  • RSPC: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (MPI). [Paper]
  • TORA-ViT: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (The University of Sydney). [Paper]
  • BadViT: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks------", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
  • ------: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (University of Pittsburgh). [Paper]
  • PreLayerNorm: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (POSTECH). [Paper]
  • CertViT: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (INRIA). [Paper][Code]
  • CleanCLIP: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", arXiv, 2023 (UCLA). [Paper]
  • RoCLIP: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (UCLA). [Paper]
  • DeepMIM: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (Microsoft). [Paper][Code]
  • TAP-ADL: "Robustifying Token Attention for Vision Transformers", arXiv, 2023 (MPI). [Paper]
  • SLaK: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (UT Austin). [Paper][Code]
  • ConvNeXt-V2: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoenCoders", CVPR, 2023 (Meta). [Paper][Code]
  • DFFormer: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (Rikkyo University, Japan). [Paper][Code]
  • CoC: "Image as Set of Points", ICLR, 2023 (Northeastern). [Paper][Code]

2022

  • HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][Code]

  • ACmix: "On the Integration of Self-Attention and Convolution", CVPR, 2022 (Tsinghua). [Paper][Code]

  • Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]

  • LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][Code]

  • DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][Code]

  • RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][Code]

  • CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][Code]

  • ------: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]

  • ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]

  • CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][Code]

  • MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][Code]

  • Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][Code]

  • DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][Code]

  • MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]

  • DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][Code]

  • Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][Code]

  • MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][Code]

  • NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][Code]

  • Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][Code]

  • PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][Code]

  • X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]

  • ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][Code]

  • UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code]

  • Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][Code]

  • DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Code]

  • ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]

  • MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]

  • VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][Code]

  • ------: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (Microsoft). [Paper]

  • Ortho: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (CAS). [Paper]

  • PerViT: "Peripheral Vision Transformer", NeurIPS, 2022 (POSTECH). [Paper]

  • LITv2: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (Monash University). [Paper][Code]

  • BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]

  • O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]

  • MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][Code]

  • BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]

  • ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]

  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]

  • PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]

  • DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]

  • NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][Code]

  • ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][Code]

  • SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]

  • EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]

  • LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]

  • Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][Code]

  • MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]

  • MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]

  • AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]

  • GrafT: "Grafting Vision Transformers", arXiv, 2022 (Stony Brook). [Paper]

  • ------: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (The University of Sydney). [Paper]

  • LTH-ViT: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (Northeastern University, China). [Paper]

  • TT: "Token Transformer: Can class token help window-based transformer build better long-range interactions------", arXiv, 2022 (Hangzhou Dianzi University). [Paper]

  • CabViT: "CabViT: Cross Attention among Blocks for Vision Transformer", arXiv, 2022 (Intellifusion, China). [Paper][Code]

  • INTERN: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (Shanghai AI Lab). [Paper][Website]

  • GGeM: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (NAVER). [Paper]

  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][Code]

  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]

  • ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][Code]

  • EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][Code]

  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][Code]

  • Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][Code]

  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]

  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][Code]

  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]

  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]

  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][Code-1][Code-2]

  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]

  • DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]

  • ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]

  • EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][Code]

  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][Code]

  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][Code]

  • DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]

  • M3ViT: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][Code]

  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][Code]

  • DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]

  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][Code]

  • GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][Code]

  • ------: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]

  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]

  • MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]

  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]

  • CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][Code]

  • EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]

  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]

  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]

  • SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][Code]

  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT). [Paper]

  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][Code]

  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code]

  • EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][Code]

  • VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code]

  • SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][Code]

  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]

  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code]

  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]

  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]

  • PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]

  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]

  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][Code]

  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][Code]

  • ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]

  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][Code]

  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]

  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][Code]

  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Code]

  • CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]

  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][Code]

  • ------: "How to Train Vision Transformer on Small-scale Datasets------", BMVC, 2022 (MBZUAI). [Paper][Code]

  • DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code]

  • iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][Code]

  • DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]

  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][Code]

  • ConvMixer: "Patches Are All You Need------", arXiv, 2022 (CMU). [Paper][Code]

  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][Code]

  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][Code]

  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (------). [Paper]

  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]

  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][Code]

  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]

  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]

  • MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][Code]

  • STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][Code]

  • ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]

  • VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][Code]

  • SD-MAE: "Masked autoenCoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][Code]

  • Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]

  • BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][Code]

  • EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]

  • iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][Code]

  • MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]

  • AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code]

  • MAE: "Masked AutoenCoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][Code][Code (pengzhiliang)]

  • SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][Code]

  • SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][Code]

  • Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][Code]

  • TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][Code]

  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]

  • SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training------", CVPRW, 2022 (Meta). [Paper]

  • MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]

  • RelViT: "Where are my Neighbors------ Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]

  • data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][Code]

  • SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code]

  • MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper][Code]

  • CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]

  • BootMAE: "Bootstrapped Masked AutoenCoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][Code]

  • TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][Code]

  • ------: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][Code]

  • HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][Code]

  • IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]

  • AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][Code]

  • SLIP: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (Berkeley + Meta). [Paper][Code]

  • mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (Peking University). [Paper]

  • SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin). [Paper][Code]

  • TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][Code]

  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (Arizona State University). [Paper]

  • GreenMIM: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (The University of Tokyo). [Paper][Code]

  • DP-CutMix: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (Yonsei University). [Paper]

  • ------: "How to train your ViT------ Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][Code (rwightman)]

  • PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]

  • RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]

  • Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code]

  • Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]

  • DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]

  • DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]

  • ------: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]

  • ConvMAE: "ConvMAE: Masked Convolution Meets Masked AutoenCoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code]

  • UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][Code]

  • GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][Code]

  • SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]

  • SupMAE: "SupMAE: Supervised Masked AutoenCoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][Code]

  • LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]

  • SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]

  • ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]

  • ------: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]

  • ------: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]

  • Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][Code][Website]

  • BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][Code]

  • MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][Code]

  • PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][Code]

  • dBOT: "Exploring Target Representations for Masked AutoenCoders", arXiv, 2022 (ByteDance). [Paper]

  • PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]

  • Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]

  • TL-Align: "Token-Label Alignment for Vision Transformers", arXiv, 2022 (Tsinghua University). [Paper][Code]

  • AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code]

  • CLIPpy: "Perceptual Grouping in Vision-Language Models", arXiv, 2022 (Apple). [Paper]

  • LOCA: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (Google). [Paper]

  • FT-CLIP: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (Microsoft). [Paper][Code]

  • MixPro: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (Beijing University of Chemical Technology). [Paper][Code]

  • ConMIM: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (Tencent). [Paper][Code]

  • ccMIM: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (Shanghai Jiao Tong). [Paper]

  • CIM: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (Microsoft). [Paper]

  • MFM: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (NTU, Singapore). [Paper][Website]

  • Mask3D: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (Meta). [Paper]

  • VisualAtom: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][Code][Website]

  • MixedAE: "Mixed AutoenCoder for Self-supervised Visual Representation Learning", CVPR, 2023 (Huawei). [Paper]

  • TBM: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (Singapore University of Technology and Design). [Paper]

  • LGSimCLR: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (UMich). [Paper][Code]

  • DisCo-CLIP: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (IDEA). [Paper][Code]

  • MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (Microsoft). [Paper][Code]

  • MAGE: "MAGE: MAsked Generative EnCoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (Google). [Paper][Code]

  • MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (SenseTime). [Paper][Code]

  • iTPN: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (CAS). [Paper][Code]

  • DropKey: "DropKey for Vision Transformer", CVPR, 2023 (Meitu). [Paper]

  • FlexiViT: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (Google). [Paper][Tensorflow]

  • RA-CLIP: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (Alibaba). [Paper]

  • CLIPPO: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (Google). [Paper][JAX]

  • DMAE: "Masked AutoenCoders Enable Efficient Knowledge Distillers", CVPR, 2023 (JHU + UC Santa Cruz). [Paper][Code]

  • HPM: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (CAS). [Paper][Code]

  • LocalMIM: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (Peking University). [Paper]

  • MaskAlign: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (Shanghai AI Lab). [Paper][Code]

  • RILS: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (Tencent). [Paper][Code]

  • RelaxMIM: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (Megvii). [Paper]

  • FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code]

  • ------: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (Google). [Paper]

  • OpenCLIP: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (LAION). [Paper][Code]

  • DiHT: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (Meta). [Paper][Code]

  • M3I-Pretraining: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (Shanghai AI Lab). [Paper][Code]

  • SN-Net: "Stitchable Neural Networks", CVPR, 2023 (Monash University). [Paper][Code]

  • MAE-Lite: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (Megvii). [Paper][Code]

  • ViT-22B: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (Google). [Paper]

  • GHN-3: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models------", ICML, 2023 (Samsung). [Paper][Code]

  • A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (Westlake University, China). [Paper][Code]

  • PQCL: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (Alibaba). [Paper][Code]

  • CountBench: "Teaching CLIP to Count to Ten", arXiv, 2023 (Google). [Paper]

  • CCViT: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (Wuhan University). [Paper]

  • SoftCLIP: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (Tencent). [Paper]

  • MAE-WSP: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", arXiv, 2023 (Meta). [Paper]

  • DiffMAE: "Diffusion Models as Masked AutoenCoders", arXiv, 2023 (Meta). [Paper][Website]

  • RECLIP: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (Google). [Paper]

  • DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (Meta). [Paper]

  • ------: "Stable and low-precision training for large-scale vision-language models", arXiv, 2023 (UW). [Paper]

  • ------: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (Meta). [Paper]

  • Filter: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (Apple). [Paper]

  • CLIPA: "An Inverse Scaling Law for CLIP Training", arXiv, 2023 (UC Santa Cruz). [Paper][Code]

  • ------: "Improved baselines for vision-language pre-training", arXiv, 2023 (Meta). [Paper]

  • 3T: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (Google). [Paper]

  • LaCLIP: "Improving CLIP Training with Language Rewrites", arXiv, 2023 (Google). [Paper][Code]

  • StableRep: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", arXiv, 2023 (Google). [Paper]

  • ADDP: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (CUHK + Tsinghua). [Paper]

  • MOFI: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (Apple). [Paper]

  • CapPa: "Image Captioners Are Scalable Vision Learners Too", arXiv, 2023 (DeepMind). [Paper]

  • MaPeT: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (UniMoRE, Italy). [Paper][Code]

  • RECO: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (Google). [Paper]

  • DesCo: "DesCo: Learning Object Recognition with Rich Language Descriptions", arXiv, 2023 (UCLA). [Paper]

  • CLIPA-v2: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (UC Santa Cruz). [Paper][Code]

  • PatchMixing: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (Boston). [Paper][Website]

  • SN-Netv2: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (Monash University). [Paper][Code]

    • PNA: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (Fudan + Maryland). [Paper][Code]
  • MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]

  • Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations------", ICLR, 2022 (Rice University). [Paper][Code]

  • Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]

  • ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]

  • Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]

  • Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]

  • APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]

  • Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][Code]

  • RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][Code]

  • Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]

  • VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][Code]

  • FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][Code]

  • CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][Code]

  • ------: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][Code]

  • ------: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]

  • AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]

  • ------: "Are Vision Transformers Robust to Patch Perturbations------", ECCV, 2022 (TUM). [Paper]

  • ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][Code]

  • ------: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][Code]

  • PAR: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (Tianjin University). [Paper]

  • RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (Tel-Aviv). [Paper][Code]

  • ------: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (Google). [Paper]

  • NVD: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (Boston). [Paper]

  • ------: "Are Vision Transformers Robust to Spurious Correlations------", arXiv, 2022 (UW-Madison). [Paper]

  • MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]

  • ------: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]

  • ------: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]

  • FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]

  • Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code]

  • ------: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]

  • ------: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]

  • ------: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]

  • CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]

  • ------: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]

  • ------: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]

  • C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]

  • ------: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]

  • RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]

  • MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]

  • CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][Code]

  • AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][Code]

  • Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][Code]

  • DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][Code]

  • STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]

  • AMixer: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (Tsinghua University). [Paper]

  • MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]

  • ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]

  • MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][Code]

  • PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][Code]

  • SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][Code]

  • gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]

  • ------: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]

  • DWNet: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Nankai Univerisy). [Paper][Code]

  • PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][Code]

  • ConvNext: "A ConvNet for the 2020s", CVPR, 2022 (Facebook). [Paper][Code]

  • RepLKNet: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (Megvii). [Paper][MegEngine][Code]

  • FocalNet: "Focal Modulation Networks", NeurIPS, 2022 (Microsoft). [Paper][Code]

  • HorNet: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (Tsinghua). [Paper][Code][Website]

  • Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]

  • MogaNet: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (Westlake University, China). [Paper]

  • Conv2Former: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (ByteDance). [Paper]