/awesome-visual-attention

A curated list of visual attention modules

Primary LanguagePythonMIT LicenseMIT

Efficient Vision Transformer

A curated list of visual attention modules, Flops is calculated under 64x224x224 resolution.

Table of Contents

Papers

Efficient Vision Transformer

  • DeiT: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (Facebook). [Paper][PyTorch]
  • ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook). [Paper][Code]
  • ?: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (NavInfo Europe, Netherlands). [Paper]
  • PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII). [Paper]
  • HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM). [Paper][PyTorch]
  • ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (Beihang University). [Paper][PyTorch]
  • MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (Aarhus University, Denmark). [Paper][Tensorflow]
  • SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii). [Paper][PyTorch]
  • GG-Transformer: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (JHU). [Paper][Code (in construction)]
  • DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
  • ResT: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (Nanjing University). [Paper][PyTorch]
  • Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • SOFT: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan). [Paper][PyTorch][Website]
  • IA-RED2: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM). [Paper][Website]
  • LocalViT: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • CCT: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook). [Paper][PyTorch]
  • SL-ViT: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (Aarhus University). [Paper]
  • ?: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (Aarhus University, Denmark). [Paper]
  • ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • WideNet: "Go Wider Instead of Deeper", arXiv, 2021 (NUS). [Paper]
  • Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm). [Paper]
  • IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK). [Paper]
  • DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University). [Paper][PyTorch]
  • UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao). [Paper]
  • Token-Pooling: "Token Pooling in Visual Transformers", arXiv, 2021 (Apple). [Paper]
  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][PyTorch]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][PyTorch]
  • EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][PyTorch]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][Jax]
  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][PyTorch]
  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][PyTorch-1][PyTorch-2]
  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
  • DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
  • ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
  • EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][PyTorch]
  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][PyTorch]
  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][PyTorch]
  • DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]
  • M3ViT: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][PyTorch]
  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][PyTorch]
  • DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]
  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][PyTorch]
  • GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][PyTorch]
  • ?: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]
  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
  • MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
  • CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][PyTorch]
  • EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT). [Paper]
  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][PyTorch]
  • EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
  • VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code (in construction)]
  • SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][PyTorch]
  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
  • PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][PyTorch]
  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][PyTorch]
  • ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]
  • Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", arXiv, 2022 (Meta). [Paper]
  • ViT-Ti: "RGB no more: Minimally-decoded JPEG Vision Transformers", arXiv, 2022 (UMich). [Paper]
  • Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code (in construction)]
  • ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
  • ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
  • HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
  • ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][PyTorch]

Conv + Transformer

  • LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • CeiT: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime). [Paper][PyTorch (rishikksh20)]
  • Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS). [Paper][PyTorch]
  • CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD). [Paper][PyTorch]
  • CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft). [Paper][Code]
  • ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook). [Paper]
  • ConTNet: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (ByteDance). [Paper][PyTorch]
  • SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft). [Paper]
  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][PyTorch]
  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch (in construction)]
  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][PyTorch]
  • ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI). [Paper][PyTorch]
  • DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code (in construction)]
  • iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][PyTorch]
  • DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]
  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][PyTorch]
  • ConvMixer: "Patches Are All You Need?", arXiv, 2022 (CMU). [Paper][PyTorch]
  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][PyTorch]
  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][PyTorch]
  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?). [Paper]
  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]
  • MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][PyTorch]
  • STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][[Code (in construction)(https://github.com/OpenGVLab/STM-Evaluation)]]
  • InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code (in construction)]
  • ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]
  • VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • SD-MAE: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][PyTorch (in construction)]
  • SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][PyTorch (in construction)]
  • SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][PyTorch]
  • MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]

Channel Domain

Paper (year) Implementation Key==Query Flops(G) Params(M)
Squeeze-and-Excitation Networks (2018) codes/senet.py 3.21 0.000512
Paper (year) Implementation Key==Query Flops(G) Params(M)
Effective Squeeze-Excitation (2019) codes/se.py 88 88
Paper (year) Implementation Key==Query Flops(G) Params(M)
ECA-Net (2019) codes/ecanet.py 3.21 3e-6
Paper (year) Implementation Key==Query Flops(G) Params(M)
SKNet (2019) codes/sknet.py
Paper (year) Implementation Key==Query Flops(G) Params(M)
FcaNet (2020) codes/fcanet.py
Paper (year) Implementation Key==Query Flops(G) Params(M)
Triplet Attention (2020) Pytorch Codes 7.88 0.0003

Spatial Domain

Paper (year) Implementation Key==Query Flops(G) Params(M)
Non-local Neural Networks (2018) Pytorch Codes ✔️ 425.49 0.00848
Paper (year) Implementation Key==Query Flops(G) Params(M)
SAGAN (2018) [codes/sa.py] ✔️ 260.91 0.0052
Paper (year) Implementation Key==Query Flops(G) Params(M)
ISA (2019) Pytorch Codes ✔️

Mix Domain

Paper (year) Implementation Key==Query Flops(G) Params(M)
CBAM (2018) codes/cbam.py 5.02 0.00068
Paper (year) Implementation Key==Query Flops(G) Params(M)
AA-Nets (2018) Pytorch Codes ✔️
Paper (year) Implementation Key==Query Flops(G) Params(M)
Split-Attention Networks (2020) Pytorch Codes

Lightweight Transformer Operater

Paper (year) Implementation Key==Query Flops(G) Params(M)
ParC-Net (2022) Pytorch Codes

Paper (year) Implementation Key==Query Flops(G) Params(M)
EdgeViTs (2022) Pytorch Codes