This repo contains a comprehensive paper list of Transformer & Attention for Vision Recognition / Foundation Model, including papers, Codes, and related websites. (Actively keep updating)
If you own or find some overlooked papers, you can add it to this document by pull request (recommended).
- RetNet: Retentive Network: A Successor to Transformer for Large Language Models, Arxiv, 2023 (Microsoft). [Paper][Code]
- GPViT: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (University of Edinburgh, Scotland + UCSD). [Paper][Code]
- CPVT: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (Meituan). [Paper][Code]
- LipsFormer: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (IDEA, China). [Paper][Code]
- BiFormer: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (CUHK). [Paper][Code]
- AbSViT: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (Berkeley). [Paper][Code][Website]
- DependencyViT: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (MIT). [Paper][Code]
- ResFormer: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (Fudan). [Paper][Code]
- SViT: "Vision Transformer with Super Token Sampling", CVPR, 2023 (CAS). [Paper]
- PaCa-ViT: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (NC State). [Paper][Code]
- GC-ViT: "Global Context Vision Transformers", ICML, 2023 (NVIDIA). [Paper][Code]
- MAGNETO: "MAGNETO: A Foundation Transformer", ICML, 2023 (Microsoft). [Paper]
- CrossFormer++: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (Zhejiang University). [Paper][Code]
- QFormer: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (The University of Sydney). [Paper][Code]
- ViT-Calibrator: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (Zhejiang University). [Paper]
- SpectFormer: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (Microsoft). [Paper][Code][Website]
- UniNeXt: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (Alibaba). [Paper]
- CageViT: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (Southern University of Science and Technology). [Paper]
- ------: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (UIUC). [Paper]
- 2-D-SSM: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (Tel Aviv). [Paper][Code]
- Token-Pooling: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (Apple). [Paper]
- Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code]
- ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
- ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
- HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
- ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][Code]
- HiViT: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (CAS). [Paper][Code]
- STViT: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (Alibaba). [Paper][Code]
- SparseViT: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (MIT). [Paper][Website]
- Slide-Transformer: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (Tsinghua University). [Paper][Code]
- RIFormer: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (Shanghai AI Lab). [Paper][Code][Website]
- EfficientViT: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (Microsoft). [Paper][Code]
- Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (Meta). [Paper]
- ViT-Ti: "RGB no more: Minimally-deCoded JPEG Vision Transformers", CVPR, 2023 (UMich). [Paper]
- Sparsifiner: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (University of Toronto). [Paper]
- ------: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (Baidu). [Paper]
- ElasticViT: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", arXiv, 2023 (Microsoft). [Paper]
- SeiT: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", arXiv, 2023 (NAVER). [Paper][Code]
- FastViT: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", arXiv, 2023 (Apple). [Paper]
- CloFormer: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (CAS). [Paper]
- Quadformer: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (Tel Aviv). [Paper][Code]
- SparseFormer: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (NUS). [Paper][Code]
- EMO: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (Tencent). [Paper][Code]
- SoViT: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", arXiv, 2023 (DeepMind). [Paper]
- FAT: "Lightweight Vision Transformer with Bidirectional Interaction", arXiv, 2023 (CAS). [Paper][Code]
- ByteFormer: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (Apple). [Paper]
- ------: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (Jilin University). [Paper]
- FasterViT: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (NVIDIA). [Paper]
- NextViT: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (Baidu). [Paper]
- SkipAt: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (Qualcomm). [Paper]
- SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][Code]
- SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][Code]
- MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]
- InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (Shanghai AI Laboratory). [Paper][Code]
- PSLT: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (Sun Yat-sen University). [Paper][Website]
- SwiftFormer: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", arXiv, 2023 (MBZUAI). [Paper][Code]
- model-soup: "Revisiting adapters with adversarial training", ICLR, 2023 (DeepMind). [Paper]
- ------: "Budgeted Training for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
- RobustCNN: "Can CNNs Be More Robust Than Transformers------", ICLR, 2023 (UC Santa Cruz + JHU). [Paper][Code]
- DMAE: "Denoising Masked AutoEnCoders are Certifiable Robust Vision Learners", ICLR, 2023 (Peking). [Paper][Code]
- TGR: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (CUHK). [Paper][Code]
- TrojViT: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (Indiana University Bloomington). [Paper]
- RSPC: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (MPI). [Paper]
- TORA-ViT: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (The University of Sydney). [Paper]
- BadViT: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks------", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
- ------: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (University of Pittsburgh). [Paper]
- PreLayerNorm: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (POSTECH). [Paper]
- CertViT: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (INRIA). [Paper][Code]
- CleanCLIP: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", arXiv, 2023 (UCLA). [Paper]
- RoCLIP: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (UCLA). [Paper]
- DeepMIM: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (Microsoft). [Paper][Code]
- TAP-ADL: "Robustifying Token Attention for Vision Transformers", arXiv, 2023 (MPI). [Paper]
- SLaK: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (UT Austin). [Paper][Code]
- ConvNeXt-V2: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoenCoders", CVPR, 2023 (Meta). [Paper][Code]
- DFFormer: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (Rikkyo University, Japan). [Paper][Code]
- CoC: "Image as Set of Points", ICLR, 2023 (Northeastern). [Paper][Code]
-
HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][Code]
-
ACmix: "On the Integration of Self-Attention and Convolution", CVPR, 2022 (Tsinghua). [Paper][Code]
-
Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]
-
LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][Code]
-
DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][Code]
-
RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][Code]
-
CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][Code]
-
------: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]
-
ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]
-
CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][Code]
-
MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][Code]
-
Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][Code]
-
DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][Code]
-
MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]
-
DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][Code]
-
Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][Code]
-
MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][Code]
-
NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][Code]
-
Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][Code]
-
PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][Code]
-
X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]
-
ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][Code]
-
UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code]
-
Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][Code]
-
DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Code]
-
ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]
-
MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
-
VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][Code]
-
------: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (Microsoft). [Paper]
-
Ortho: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (CAS). [Paper]
-
PerViT: "Peripheral Vision Transformer", NeurIPS, 2022 (POSTECH). [Paper]
-
LITv2: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (Monash University). [Paper][Code]
-
BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]
-
O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]
-
MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][Code]
-
BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]
-
ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]
-
HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
-
PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]
-
DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]
-
NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][Code]
-
ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][Code]
-
SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]
-
EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]
-
LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]
-
Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][Code]
-
MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]
-
MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]
-
AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]
-
GrafT: "Grafting Vision Transformers", arXiv, 2022 (Stony Brook). [Paper]
-
------: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (The University of Sydney). [Paper]
-
LTH-ViT: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (Northeastern University, China). [Paper]
-
TT: "Token Transformer: Can class token help window-based transformer build better long-range interactions------", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
-
CabViT: "CabViT: Cross Attention among Blocks for Vision Transformer", arXiv, 2022 (Intellifusion, China). [Paper][Code]
-
INTERN: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (Shanghai AI Lab). [Paper][Website]
-
GGeM: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (NAVER). [Paper]
-
Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][Code]
-
PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
-
ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][Code]
-
EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][Code]
-
QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][Code]
-
Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][Code]
-
QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][JAX]
-
LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][Code]
-
A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
-
PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
-
Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][Code-1][Code-2]
-
AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
-
DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
-
ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
-
EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][Code]
-
SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][Code]
-
SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][Code]
-
DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]
-
M3ViT: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][Code]
-
ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][Code]
-
DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]
-
EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][Code]
-
GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][Code]
-
------: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]
-
TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
-
MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
-
ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
-
CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][Code]
-
EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
-
SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
-
TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
-
SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][Code]
-
EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT). [Paper]
-
Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][Code]
-
SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code]
-
EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][Code]
-
VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code]
-
SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][Code]
-
MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
-
LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code]
-
Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
-
XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
-
PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
-
ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
-
DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][Code]
-
MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][Code]
-
ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]
-
MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][Code]
-
CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
-
Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][Code]
-
TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Code]
-
CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
-
ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][Code]
-
------: "How to Train Vision Transformer on Small-scale Datasets------", BMVC, 2022 (MBZUAI). [Paper][Code]
-
DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code]
-
iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][Code]
-
DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]
-
CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][Code]
-
ConvMixer: "Patches Are All You Need------", arXiv, 2022 (CMU). [Paper][Code]
-
MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][Code]
-
UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][Code]
-
EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (------). [Paper]
-
MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
-
DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][Code]
-
ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
-
Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]
-
MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][Code]
-
STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][Code]
-
ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]
-
VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][Code]
-
SD-MAE: "Masked autoenCoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][Code]
-
Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]
-
BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][Code]
-
EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]
-
iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][Code]
-
MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]
-
AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code]
-
MAE: "Masked AutoenCoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][Code][Code (pengzhiliang)]
-
SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][Code]
-
SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][Code]
-
Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][Code]
-
TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][Code]
-
PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]
-
SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training------", CVPRW, 2022 (Meta). [Paper]
-
MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]
-
RelViT: "Where are my Neighbors------ Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]
-
data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][Code]
-
SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code]
-
MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper][Code]
-
CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]
-
BootMAE: "Bootstrapped Masked AutoenCoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][Code]
-
TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][Code]
-
------: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][Code]
-
HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][Code]
-
IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]
-
AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][Code]
-
SLIP: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (Berkeley + Meta). [Paper][Code]
-
mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (Peking University). [Paper]
-
SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin). [Paper][Code]
-
TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][Code]
-
PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (Arizona State University). [Paper]
-
GreenMIM: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (The University of Tokyo). [Paper][Code]
-
DP-CutMix: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (Yonsei University). [Paper]
-
------: "How to train your ViT------ Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][Code (rwightman)]
-
PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
-
RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
-
Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code]
-
Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
-
DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]
-
DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]
-
------: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]
-
ConvMAE: "ConvMAE: Masked Convolution Meets Masked AutoenCoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code]
-
UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][Code]
-
GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][Code]
-
SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]
-
SupMAE: "SupMAE: Supervised Masked AutoenCoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][Code]
-
LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]
-
SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]
-
ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]
-
------: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]
-
------: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]
-
Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][Code][Website]
-
BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][Code]
-
MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][Code]
-
PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][Code]
-
dBOT: "Exploring Target Representations for Masked AutoenCoders", arXiv, 2022 (ByteDance). [Paper]
-
PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]
-
Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]
-
TL-Align: "Token-Label Alignment for Vision Transformers", arXiv, 2022 (Tsinghua University). [Paper][Code]
-
AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code]
-
CLIPpy: "Perceptual Grouping in Vision-Language Models", arXiv, 2022 (Apple). [Paper]
-
LOCA: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (Google). [Paper]
-
FT-CLIP: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (Microsoft). [Paper][Code]
-
MixPro: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (Beijing University of Chemical Technology). [Paper][Code]
-
ConMIM: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (Tencent). [Paper][Code]
-
ccMIM: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (Shanghai Jiao Tong). [Paper]
-
CIM: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (Microsoft). [Paper]
-
MFM: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (NTU, Singapore). [Paper][Website]
-
Mask3D: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (Meta). [Paper]
-
VisualAtom: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (National Institute of Advanced Industrial Science and Technology (AIST), Japan). [Paper][Code][Website]
-
MixedAE: "Mixed AutoenCoder for Self-supervised Visual Representation Learning", CVPR, 2023 (Huawei). [Paper]
-
TBM: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
-
LGSimCLR: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (UMich). [Paper][Code]
-
DisCo-CLIP: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (IDEA). [Paper][Code]
-
MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (Microsoft). [Paper][Code]
-
MAGE: "MAGE: MAsked Generative EnCoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (Google). [Paper][Code]
-
MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (SenseTime). [Paper][Code]
-
iTPN: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (CAS). [Paper][Code]
-
DropKey: "DropKey for Vision Transformer", CVPR, 2023 (Meitu). [Paper]
-
FlexiViT: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (Google). [Paper][Tensorflow]
-
RA-CLIP: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (Alibaba). [Paper]
-
CLIPPO: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (Google). [Paper][JAX]
-
DMAE: "Masked AutoenCoders Enable Efficient Knowledge Distillers", CVPR, 2023 (JHU + UC Santa Cruz). [Paper][Code]
-
HPM: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (CAS). [Paper][Code]
-
LocalMIM: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (Peking University). [Paper]
-
MaskAlign: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (Shanghai AI Lab). [Paper][Code]
-
RILS: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (Tencent). [Paper][Code]
-
RelaxMIM: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (Megvii). [Paper]
-
FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code]
-
------: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (Google). [Paper]
-
OpenCLIP: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (LAION). [Paper][Code]
-
DiHT: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (Meta). [Paper][Code]
-
M3I-Pretraining: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (Shanghai AI Lab). [Paper][Code]
-
SN-Net: "Stitchable Neural Networks", CVPR, 2023 (Monash University). [Paper][Code]
-
MAE-Lite: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (Megvii). [Paper][Code]
-
ViT-22B: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (Google). [Paper]
-
GHN-3: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models------", ICML, 2023 (Samsung). [Paper][Code]
-
A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (Westlake University, China). [Paper][Code]
-
PQCL: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (Alibaba). [Paper][Code]
-
CountBench: "Teaching CLIP to Count to Ten", arXiv, 2023 (Google). [Paper]
-
CCViT: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (Wuhan University). [Paper]
-
SoftCLIP: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (Tencent). [Paper]
-
MAE-WSP: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", arXiv, 2023 (Meta). [Paper]
-
DiffMAE: "Diffusion Models as Masked AutoenCoders", arXiv, 2023 (Meta). [Paper][Website]
-
RECLIP: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (Google). [Paper]
-
DINOv2: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (Meta). [Paper]
-
------: "Stable and low-precision training for large-scale vision-language models", arXiv, 2023 (UW). [Paper]
-
------: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (Meta). [Paper]
-
Filter: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (Apple). [Paper]
-
CLIPA: "An Inverse Scaling Law for CLIP Training", arXiv, 2023 (UC Santa Cruz). [Paper][Code]
-
------: "Improved baselines for vision-language pre-training", arXiv, 2023 (Meta). [Paper]
-
3T: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (Google). [Paper]
-
LaCLIP: "Improving CLIP Training with Language Rewrites", arXiv, 2023 (Google). [Paper][Code]
-
StableRep: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", arXiv, 2023 (Google). [Paper]
-
ADDP: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (CUHK + Tsinghua). [Paper]
-
MOFI: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (Apple). [Paper]
-
CapPa: "Image Captioners Are Scalable Vision Learners Too", arXiv, 2023 (DeepMind). [Paper]
-
MaPeT: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (UniMoRE, Italy). [Paper][Code]
-
RECO: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (Google). [Paper]
-
DesCo: "DesCo: Learning Object Recognition with Rich Language Descriptions", arXiv, 2023 (UCLA). [Paper]
-
CLIPA-v2: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (UC Santa Cruz). [Paper][Code]
-
PatchMixing: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (Boston). [Paper][Website]
-
SN-Netv2: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (Monash University). [Paper][Code]
-
MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]
-
Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations------", ICLR, 2022 (Rice University). [Paper][Code]
-
Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]
-
ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]
-
Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]
-
Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]
-
APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]
-
Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][Code]
-
RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][Code]
-
Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]
-
VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][Code]
-
FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][Code]
-
CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][Code]
-
------: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][Code]
-
------: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]
-
AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]
-
------: "Are Vision Transformers Robust to Patch Perturbations------", ECCV, 2022 (TUM). [Paper]
-
ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][Code]
-
------: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][Code]
-
PAR: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (Tianjin University). [Paper]
-
RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (Tel-Aviv). [Paper][Code]
-
------: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (Google). [Paper]
-
NVD: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (Boston). [Paper]
-
------: "Are Vision Transformers Robust to Spurious Correlations------", arXiv, 2022 (UW-Madison). [Paper]
-
MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]
-
------: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]
-
------: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
-
FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]
-
Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code]
-
------: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]
-
------: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
-
------: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]
-
CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]
-
------: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]
-
------: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]
-
C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]
-
------: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]
-
RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]
-
MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]
-
CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][Code]
-
AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][Code]
-
Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][Code]
-
DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][Code]
-
STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]
-
AMixer: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (Tsinghua University). [Paper]
-
MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]
-
ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]
-
MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][Code]
-
PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][Code]
-
SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][Code]
-
gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]
-
------: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]
-
DWNet: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Nankai Univerisy). [Paper][Code]
-
PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][Code]
-
ConvNext: "A ConvNet for the 2020s", CVPR, 2022 (Facebook). [Paper][Code]
-
RepLKNet: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (Megvii). [Paper][MegEngine][Code]
-
FocalNet: "Focal Modulation Networks", NeurIPS, 2022 (Microsoft). [Paper][Code]
-
HorNet: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (Tsinghua). [Paper][Code][Website]
-
Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]
-
MogaNet: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (Westlake University, China). [Paper]
-
Conv2Former: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (ByteDance). [Paper]