/Awesome-Transformer-Attention

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

Ultimate-Awesome-Transformer-Attention Awesome

This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites.
This list is maintained by Min-Hung Chen. (Actively keep updating)

If you find some ignored papers, feel free to create pull requests, open issues, or email me.
Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider citing and ★STARing this list.
Feel free to share this list with others!

[Update: February, 2023] Added all the related papers from ICLR 2023!
[Update: December, 2022] Added attention-free papers from Networks Beyond Attention (GitHub) made by Jianwei Yang
[Update: November, 2022] Added all the related papers from NeurIPS 2022!
[Update: October, 2022] Split the 2nd half of the paper list to README_2.md
[Update: October, 2022] Added all the related papers from ECCV 2022!
[Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer!
[Update: June, 2022] Added all the related papers from CVPR 2022!


Overview

------ (The following papers are move to README_2.md) ------


Survey

  • "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (IBM). [Paper][GitHub]
  • "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (Indian Institute of Information Technology). [Paper]
  • "A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
  • "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
  • "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (Covenant University, Nigeria). [Paper]
  • "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (Sejong University). [Paper]
  • "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
  • "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
  • "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
  • "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
  • "Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
  • "Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
  • "3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
  • "Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
  • "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
  • "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
  • "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
  • "Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
  • "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
  • "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
  • "Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
  • "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
  • "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
  • "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
  • "Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
  • "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
  • "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
  • "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
  • "A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
  • "A Survey of Visual Transformers", arXiv, 2021 (CAS). [Paper]
  • "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]

[Back to Overview]

Image Classification / Backbone

Replace Conv w/ Attention

Pure Attention

Conv-stem + Attention

  • GSA-Net: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (lucidrains)]
  • HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper][PyTorch (lucidrains)]
  • CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
  • HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][PyTorch (in construction)]

Conv + Attention

[Back to Overview]

Vision Transformer

General Vision Transformer

  • ViT: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (Google). [Paper][Tensorflow][PyTorch (lucidrains)][JAX (conceptofmind)]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
  • PiT: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (NAVER). [Paper][PyTorch]
  • VT: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (Facebook). [Paper][PyTorch (tahmid0007)]
  • PVT: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • iRPE: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • CaiT: "Going deeper with Image Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • Swin-Transformer: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (Microsoft). [Paper][PyTorch][PyTorch (berniwal)]
  • T2T-ViT: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (Yitu). [Paper][PyTorch]
  • FFNBN: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (Microsoft). [Paper]
  • DPT: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (CAS). [Paper][PyTorch]
  • Focal: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
  • XCiT: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (Facebook). [Paper]
  • Twins: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (Meituan). [Paper][PyTorch)]
  • ARM: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (Amazon). [Paper][GitHub (in construction)]
  • DVT: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch]
  • Aug-S: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (Huawei). [Paper]
  • TNT: "Transformer in Transformer", NeurIPS, 2021 (Huawei). [Paper][PyTorch][PyTorch (lucidrains)]
  • ViTAE: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (The University of Sydney). [Paper][PyTorch]
  • DeepViT: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (NUS + ByteDance). [Paper][Code]
  • So-ViT: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (Dalian University of Technology). [Paper][PyTorch]
  • LV-ViT: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (ByteDance). [Paper][PyTorch]
  • NesT: "Aggregating Nested Transformers", arXiv, 2021 (Google). [Paper][Tensorflow]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (Alibaba). [Paper]
  • Refined-ViT: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • Shuffle-Transformer: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (Tencent). [Paper]
  • CAT: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (KuaiShou). [Paper][PyTorch]
  • V-MoE: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (Google). [Paper]
  • P2T: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (Nankai University). [Paper]
  • PvTv2: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (Nanjing University). [Paper][PyTorch]
  • LG-Transformer: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (IIAI, UAE). [Paper]
  • ViP: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (Oxford). [Paper]
  • Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]
  • LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][PyTorch]
  • DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][PyTorch (in construction)]
  • RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][PyTorch]
  • CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][PyTorch]
  • ?: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]
  • ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]
  • CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][PyTorch]
  • DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][PyTorch (in construction)]
  • MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]
  • DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
  • Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][PyTorch]
  • NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][PyTorch]
  • PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][PyTorch]
  • X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]
  • ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][PyTorch]
  • UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code (in construction)]
  • Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][PyTorch]
  • DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]
  • MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
  • VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
  • ?: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (Microsoft). [Paper]
  • Ortho: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (CAS). [Paper]
  • PerViT: "Peripheral Vision Transformer", NeurIPS, 2022 (POSTECH). [Paper]
  • LITv2: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (Monash University). [Paper][PyTorch]
  • BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]
  • O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]
  • MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][PyTorch]
  • BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]
  • ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]
  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
  • PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]
  • DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]
  • NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][PyTorch]
  • ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][PyTorch (in construction)]
  • SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]
  • EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]
  • GC-ViT: "Global Context Vision Transformers", arXiv, 2022 (NVIDIA). [Paper][PyTorch]
  • LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]
  • Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][PyTorch]
  • MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]
  • MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]
  • AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • MAGNETO: "Foundation Transformers", arXiv, 2022 (Microsoft). [Paper]
  • GrafT: "Grafting Vision Transformers", arXiv, 2022 (Stony Brook). [Paper]
  • ?: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (The University of Sydney). [Paper]
  • LTH-ViT: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (Northeastern University, China). [Paper]
  • TT: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
  • CabViT: "CabViT: Cross Attention among Blocks for Vision Transformer", arXiv, 2022 (Intellifusion, China). [Paper][PyTorch (in construction)]
  • SViT: "Vision Transformer with Super Token Sampling", arXiv, 2022 (CAS). [Paper]
  • ResFormer: "ResFormer: Scaling ViTs with Multi-Resolution Training", arXiv, 2022 (Fudan). [Paper]
  • INTERN: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (Shanghai AI Lab). [Paper][Website]
  • GGeM: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (NAVER). [Paper]
  • GPViT: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (University of Edinburgh, Scotland + UCSD). [Paper][PyTorch]
  • CPVT: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (Meituan). [Paper][Code (in construction)]
  • LipsFormer: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (IDEA, China). [Paper]

Efficient Vision Transformer

  • DeiT: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (Facebook). [Paper][PyTorch]
  • ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook). [Paper][Code]
  • ?: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (NavInfo Europe, Netherlands). [Paper]
  • PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII). [Paper]
  • HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM). [Paper][PyTorch]
  • ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (Beihang University). [Paper][PyTorch]
  • MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (Aarhus University, Denmark). [Paper][Tensorflow]
  • SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii). [Paper][PyTorch]
  • GG-Transformer: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (JHU). [Paper][Code (in construction)]
  • DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
  • ResT: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (Nanjing University). [Paper][PyTorch]
  • Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • SOFT: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan). [Paper][PyTorch][Website]
  • IA-RED2: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM). [Paper][Website]
  • LocalViT: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • CCT: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook). [Paper][PyTorch]
  • SL-ViT: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (Aarhus University). [Paper]
  • ?: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (Aarhus University, Denmark). [Paper]
  • ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • WideNet: "Go Wider Instead of Deeper", arXiv, 2021 (NUS). [Paper]
  • Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm). [Paper]
  • IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK). [Paper]
  • DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University). [Paper][PyTorch]
  • UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao). [Paper]
  • Token-Pooling: "Token Pooling in Visual Transformers", arXiv, 2021 (Apple). [Paper]
  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][PyTorch]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][PyTorch]
  • EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][PyTorch]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][Jax]
  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][PyTorch]
  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][PyTorch-1][PyTorch-2]
  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
  • DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
  • ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
  • EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][PyTorch]
  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][PyTorch]
  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][PyTorch]
  • DFvT: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (Alibaba). [Paper]
  • M3ViT: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (UT Austin). [Paper][PyTorch]
  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (Nanjing University). [Paper][PyTorch]
  • DeiT-Manifold: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (Huawei). [Paper]
  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (Snap). [Paper][PyTorch]
  • GhostNetV2: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (Huawei). [Paper][PyTorch]
  • ?: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (Borealis AI, Canada). [Paper]
  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
  • MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
  • CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][PyTorch]
  • EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT). [Paper]
  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][PyTorch]
  • EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
  • VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code (in construction)]
  • SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][PyTorch]
  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
  • PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][PyTorch]
  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][PyTorch]
  • ViT-LSLA: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (Southwest University). [Paper]
  • Castling-ViT: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", arXiv, 2022 (Meta). [Paper]
  • ViT-Ti: "RGB no more: Minimally-decoded JPEG Vision Transformers", arXiv, 2022 (UMich). [Paper]
  • Tri-Level: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (Northeastern University). [Paper][Code (in construction)]
  • ViTCoD: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Georgia Tech). [Paper]
  • ViTALiTy: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Rice University). [Paper]
  • HeatViT: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (Northeastern University). [Paper]
  • ToMe: "Token Merging: Your ViT But Faster", ICLR, 2023 (Meta). [Paper][PyTorch]

Conv + Transformer

  • LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • CeiT: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime). [Paper][PyTorch (rishikksh20)]
  • Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS). [Paper][PyTorch]
  • CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD). [Paper][PyTorch]
  • CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft). [Paper][Code]
  • ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook). [Paper]
  • ConTNet: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (ByteDance). [Paper][PyTorch]
  • SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft). [Paper]
  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][PyTorch]
  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch (in construction)]
  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][PyTorch]
  • ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI). [Paper][PyTorch]
  • DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code (in construction)]
  • iFormer: "Inception Transformer", NeurIPS, 2022 (Sea AI Lab). [Paper][PyTorch]
  • DenseDCT: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (University of Kansas). [Paper]
  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][PyTorch]
  • ConvMixer: "Patches Are All You Need?", arXiv, 2022 (CMU). [Paper][PyTorch]
  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][PyTorch]
  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][PyTorch]
  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?). [Paper]
  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]
  • MetaFormer: "MetaFormer Baselines for Vision", arXiv, 2022 (Sea AI Lab). [Paper][PyTorch]
  • STM: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (Tsinghua University). [Paper][[Code (in construction)(https://github.com/OpenGVLab/STM-Evaluation)]]
  • InternImage: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code (in construction)]
  • ParCNetV2: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (Intellifusion, China). [Paper]
  • VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • SD-MAE: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (Hangzhou Dianzi University). [Paper][PyTorch (in construction)]
  • SATA: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (University of Kansas). [Paper][PyTorch (in construction)]
  • SparK: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (Bytedance). [Paper][PyTorch]
  • MOAT: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (Google). [Paper][Tensorflow]

Training + Transformer

  • iGPT: "Generative Pretraining From Pixels", ICML, 2020 (OpenAI). [Paper][Tensorflow]
  • MoCo-V3: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper]
  • DINO: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • drloc: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (University of Trento). [Paper][PyTorch]
  • CARE: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (Tencent). [Paper][PyTorch]
  • MST: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (SenseTime). [Paper]
  • SiT: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (University of Surrey). [Paper][PyTorch]
  • MoBY: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • ?: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (Pune Institute of Computer Technology, India). [Paper]
  • Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]
  • BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][PyTorch]
  • EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]
  • iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][PyTorch]
  • MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]
  • AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code (in construction)]
  • MAE: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][PyTorch][PyTorch (pengzhiliang)]
  • SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
  • TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][PyTorch]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]
  • SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (Meta). [Paper]
  • MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]
  • RelViT: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]
  • data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][PyTorch]
  • SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code (in construction)]
  • MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper][PyTorch]
  • CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]
  • BootMAE: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][PyTorch]
  • ?: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
  • IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]
  • AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][PyTorch]
  • SLIP: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (Berkeley + Meta). [Paper][Pytorch]
  • mc-BEiT: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (Peking University). [Paper]
  • SL2O: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][PyTorch]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (Arizona State University). [Paper]
  • GreenMIM: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (The University of Tokyo). [Paper][PyTorch]
  • DP-CutMix: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (Yonsei University). [Paper]
  • ?: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][PyTorch (rwightman)]
  • PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code (in construction)]
  • Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]
  • DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]
  • ?: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]
  • ConvMAE: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][PyTorch (in construction)]
  • UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", arXiv, 2022 (Westlake University, China). [Paper][PyTorch]
  • GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][PyTorch]
  • HiViT: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (CAS). [Paper]
  • ?: "A Closer Look at Self-supervised Lightweight Vision Transformers", arXiv, 2022 (Megvii). [Paper]
  • SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]
  • SupMAE: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][PyTorch]
  • LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]
  • SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]
  • ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]
  • ?: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]
  • ?: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][PyTorch][Website]
  • DropKey: "DropKey", arXiv, 2022 (Meitu). [Paper]
  • BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][PyTorch (in construction)]
  • PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][PyTorch]
  • MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", arXiv, 2022 (Microsoft). [Paper]
  • DMAE: "Masked Autoencoders Enable Efficient Knowledge Distillers", arXiv, 2022 (JHU + UC Santa Cruz). [Paper][Code (in construction)]
  • dBOT: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (ByteDance). [Paper]
  • PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]
  • Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]
  • TL-Align: "Token-Label Alignment for Vision Transformers", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code (in construction)]
  • CLIPpy: "Perceptual Grouping in Vision-Language Models", arXiv, 2022 (Apple). [Paper]
  • iTPN: "Integrally Pre-Trained Transformer Pyramid Networks", arXiv, 2022 (CAS). [Paper][PyTorch]
  • LOCA: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (Google). [Paper]
  • FT-CLIP: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • MixPro: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (Beijing University of Chemical Technology). [Paper]
  • ConMIM: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (Tencent). [Paper][Pytorch]
  • ccMIM: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (Shanghai Jiao Tong). [Paper]
  • CIM: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (Microsoft). [Paper]
  • MFM: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (NTU, Singapore). [Paper][Website]

Robustness + Transformer

  • ViT-Robustness: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (Google). [Paper]
  • SAGA: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (University of Connecticut). [Paper]
  • ?: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (KAIST). [Paper][PyTorch]
  • ViTs-vs-CNNs: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (JHU + UC Santa Cruz). [Paper][PyTorch]
  • T-CNN: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (Facebook). [Paper]
  • Transformer-Attack: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (Xi'an Jiaotong). [Paper]
  • ?: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (University of Rennes). [Paper]
  • ?: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (ANU). [Paper][PyTorch]
  • ?: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (University of Pittsburgh). [Paper]
  • Token-Attack: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (New York University). [Paper]
  • ?: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (Google). [Paper]
  • ?: "Vision Transformers are Robust Learners", AAAI, 2022 (PyImageSearch + IBM). [Paper][Tensorflow]
  • PNA: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (Fudan + Maryland). [Paper][PyTorch]
  • MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]
  • Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (Rice University). [Paper][PyTorch]
  • Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]
  • ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]
  • Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]
  • Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]
  • APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]
  • Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][PyTorch]
  • RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][PyTorch]
  • Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]
  • VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][PyTorch]
  • FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][PyTorch]
  • CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][PyTorch]
  • ?: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][PyTorch]
  • ?: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]
  • AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]
  • ?: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (TUM). [Paper]
  • ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][PyTorch]
  • ?: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
  • PAR: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (Tianjin University). [Paper]
  • RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch]
  • ?: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (Google). [Paper]
  • NVD: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (Boston). [Paper]
  • ?: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (UW-Madison). [Paper]
  • MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • ?: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]
  • ?: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]
  • Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code (in construction)]
  • ?: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]
  • ?: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • ?: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]
  • CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]
  • ?: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]
  • ?: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]
  • C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]
  • ?: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]
  • RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]
  • MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]
  • model-soup: "Revisiting adapters with adversarial training", ICLR, 2023 (DeepMind). [Paper]
  • ?: "Budgeted Training for Vision Transformer", ICLR, 2023 (Tsinghua). [Paper]
  • RobustCNN: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 (UC Santa Cruz + JHU). [Paper][PyTorch]
  • DMAE: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 (Peking). [Paper][PyTorch]

Model Compression + Transformer

  • ViT-quant: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • VTP: "Visual Transformer Pruning", arXiv, 2021 (Huawei). [Paper]
  • NViT: "NViT: Vision Transformer Compression and Parameter Redistribution", arXiv, 2021 (NVIDIA). [Paper]
  • MD-ViT: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (Princeton). [Paper]
  • FQ-ViT: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • UVC: "Unified Visual Transformer Compression", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • MiniViT: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Auto-ViT-Acc: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (Northeastern University). [Paper]
  • SPViT: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (Northeastern University). [Paper][PyTorch]
  • PSAQ-ViT: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (CAS). [Paper][PyTorch]
  • PTQ4ViT: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (Peking University). [Paper]
  • EAPruning: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (Meituan). [Paper]
  • Q-ViT: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (Beihang University). [Paper][PyTorch]
  • SAViT: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 (Hikvision). [Paper]
  • VTC-LFC: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 (Alibaba). [Paper][PyTorch]
  • Q-ViT: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (Megvii). [Paper]
  • VAQF: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (Northeastern University). [Paper]
  • VTP: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (UCLA). [Paper]
  • SiDT: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (UC Irvine). [Paper]
  • I-ViT: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", arXiv, 2022 (CAS). [Paper]
  • PSAQ-ViT-V2: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (CAS). [Paper][PyTorch]
  • AS: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (Baidu). [Paper]
  • SaiT: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (Samsung). [Paper]
  • oViT: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (IST Austria). [Paper]
  • BiViT: "BiViT: Extremely Compressed Binary Vision Transformer", arXiv, 2022 (Zhejiang University). [Paper]
  • CPT-V: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 (UT Austin). [Paper]

[Back to Overview]

Attention-Free

MLP-Series

  • RepMLP: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • EAMLP: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua University). [Paper]
  • Forward-Only: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (Oxford). [Paper][PyTorch]
  • ResMLP: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (Facebook). [Paper]
  • ?: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (Tsinghua). [Paper]
  • ViP: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • CCS: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (Baidu). [Paper]
  • S2-MLPv2: "S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (Baidu). [Paper]
  • RaftMLP: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (Rikkyo University, Japan). [Paper][PyTorch]
  • Hire-MLP: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (Huawei). [Paper]
  • Sparse-MLP: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (NUS). [Paper]
  • ConvMLP: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • sMLP: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (Microsoft). [Paper]
  • MLP-Mixer: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (Google). [Paper][Tensorflow][PyTorch-1 (lucidrains)][PyTorch-2 (rishikksh20)]
  • gMLP: "Pay Attention to MLPs", NeurIPS, 2021 (Google). [Paper][PyTorch (antonyvigouret)]
  • S2-MLP: "S2-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (Baidu). [Paper]
  • CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][PyTorch]
  • AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][PyTorch]
  • Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][PyTorch]
  • DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][PyTorch]
  • STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]
  • AMixer: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (Tsinghua University). [Paper]
  • MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]
  • ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]
  • MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][PyTorch]
  • PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][PyTorch]
  • SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][PyTorch]
  • gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]
  • ?: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]

Other Attention-Free

  • DWNet: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Nankai Univerisy). [Paper][PyTorch]
  • PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][PyTorch]
  • ConvNext: "A ConvNet for the 2020s", CVPR, 2022 (Facebook). [Paper][PyTorch]
  • RepLKNet: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (Megvii). [Paper][MegEngine][PyTorch]
  • FocalNet: "Focal Modulation Networks", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
  • HorNet: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch][Website]
  • Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]
  • MogaNet: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (Westlake University, China). [Paper]
  • Conv2Former: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (ByteDance). [Paper]
  • CoC: "Image as Set of Points", ICLR, 2023 (Northeastern). [Paper][PyTorch]
  • SLaK: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (UT Austin). [Paper][PyTorch]

[Back to Overview]

Analysis for Transformer

  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
  • Transformer-Explainability: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (Tel Aviv). [Paper][PyTorch]
  • ?: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (Princeton). [Paper]
  • ?: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (HKU). [Paper]
  • ?: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (Google). [Paper]
  • ?: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (MBZUAI). [Paper][PyTorch]
  • FoveaTer: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (UCSB). [Paper]
  • ?: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (Microsoft). [Paper]
  • ?: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (Google). [Paper]
  • ?: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (Horizon Robotic). [Paper]
  • ?: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (Temple University). [Paper][PyTorch]
  • FDSL: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (AIST). [Paper][PyTorch][Website]
  • AlterNet: "How Do Vision Transformers Work?", ICLR, 2022 (Yonsei University). [Paper][PyTorch]
  • ?: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (Google). [Paper][Tensorflow]
  • ?: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (Stanford). [Paper]
  • ?: "Three things everyone should know about Vision Transformers", ECCV, 2022 (Meta). [Paper]
  • ?: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 (Princeton). [Paper]
  • AWD-ViT: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (JD). [Paper]
  • ?: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (Quintic AI, CA). [Paper][Code]
  • MJP: "Breaking the Chain of Gradient Leakage in Vision Transformers", arXiv, 2022 (Tencent). [Paper]
  • ?: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • ?: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
  • ?: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (Technion Israel Institute Of Technology). [Paper]
  • ProtoPFormer: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
  • ICLIP: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (HKUST). [Paper][Code (in construction)]
  • ?: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (Google). [Paper]
  • ?: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (Monash University). [Paper][PyTorch]
  • ViT-CX: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 (HKUST). [Paper]
  • ?: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
  • IAV: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 (KAIST). [Paper]
  • ?: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", arXiv, 2022 (Maryland). [Paper][PyTorch][Website]
  • ViT-Shapley: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 (UW). [Paper][PyTorch]
  • ImageNet-X: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 (Meta). [Paper]
  • ?: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 (Rensselaer Polytechnic Institute, NY). [Paper]
  • ?: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 (NAVER). [Paper]
  • ?: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 (Stanford). [Paper]
  • CLIP-Dissect: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 (UCSD). [Paper]

[Back to Overview]

Detection

Object Detection

  • CNN-based backbone:
    • DETR: "End-to-End Object Detection with Transformers", ECCV, 2020 (Facebook). [Paper][PyTorch]
    • Deformable DETR: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (SenseTime). [Paper][PyTorch]
    • UP-DETR: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch]
    • SMCA: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (CUHK). [Paper][PyTorch]
    • Conditional-DETR: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (Microsoft). [Paper]
    • PnP-DETR: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (Yitu). [Paper][Code (in construction)]
    • TSP: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (CMU). [Paper]
    • Dynamic-DETR: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (Microsoft). [Paper]
    • ViT-YOLO: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (Xidian University). [Paper]
    • ACT: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (Peking + CUHK). [Paper][PyTorch]
    • DIL-ViT: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (Monash University Malaysia). [Paper]
    • Efficient-DETR: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (Megvii). [Paper]
    • CA-FPN: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (CAS). [Paper]
    • DETReg: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (Tel-Aviv + Berkeley). [Paper][Website]
    • GQPos: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (Megvii). [Paper]
    • Anchor-DETR: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (Megvii). [Paper][PyTorch]
    • Sparse-DETR: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (Kakao). [Paper][PyTorch]
    • DAB-DETR: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (IDEA, China). [Paper][PyTorch]
    • DN-DETR: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (International Digital Economy Academy (IDEA), China). [Paper][PyTorch]
    • SAM-DETR: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • AdaMixer: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (Nanjing University). [Paper][Code (in construction)]
    • DESTR: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (Oregon State). [Paper]
    • REGO: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
    • ?: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (Ant Group). [Paper]
    • DE-DETR: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (JD). [Paper][PyTorch]
    • DFFT: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (Tencent). [Paper]
    • Cornerformer: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (Huawei). [Paper]
    • ?: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (Microsoft). [Paper][Code (in construction)]
    • Obj2Seq: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 (CAS). [Paper][PyTorch]
    • KA: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (Zhejiang University). [Paper]
    • MIMDet: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", arXiv, 2022 (Tencent). [Paper][PyTorch]
    • imTED: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", arXiv, 2022 (CAS). [Paper]
    • AO2-DETR: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (Peking University). [Paper]
    • MaskDINO: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", arXiv, 2022 (IDEA, China). [Paper][Code (in construction)]
    • TCC: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (The University of Sydney). [Paper]
    • Conditional-DETR-V2: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (Peking University). [Paper]
    • Group-DETR: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", arXiv, 2022 (Baidu). [Paper]
    • H-DETR: "DETRs with Hybrid Matching", arXiv, 2022 (Microsoft). [Paper]
    • SAM-DETR++: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
    • IMFA: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
    • ComplETR: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (Amazon). [Paper]
    • Pair-DETR: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 (Amazon). [Paper]
    • SAP-DETR: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", arXiv, 2022 (CAS). [Paper]
    • Group-DETR-v2: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 (Baidu). [Paper]
    • KD-DETR: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 (Baidu). [Paper]
    • D3ETR: "D3ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 (Peking University). [Paper]
    • DETRDistill: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", arXiv, 2022 (USTC). [Paper]
    • each-DETR: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 (CUHK). [Paper][Code (in construction)]
    • Co-DETR: "DETRs with Collaborative Hybrid Assignments Training", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
    • ViT-Adapter: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • Transformer-based backbone:
    • ViT-FRCNN: "Toward Transformer-Based Object Detection", arXiv, 2020 (Pinterest). [Paper]
    • WB-DETR: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (CAS). [Paper]
    • YOLOS: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (Horizon Robotics). [Paper][PyTorch]
    • ?: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (Facebook). [Paper]
    • ViDT: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (NAVER). [Paper][PyTorch]
    • FP-DETR: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (USTC). [Paper]
    • DETR++: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (Google). [Paper]
    • ViTDet: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (Meta). [Paper]
    • UViT: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (Google). [Paper]
    • CFDT: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 (Huawei). [Paper]
    • D2ETR: "D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (Alibaba). [Paper][PyTorch]
    • DINO: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 (IDEA, China). [Paper][PyTorch]

[Back to Overview]

3D Object Detection

  • AST-GRU: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (Baidu). [Paper][Code (in construction)]
  • Pointformer: "3D Object Detection with Pointformer", arXiv, 2020 (Tsinghua). [Paper]
  • CT3D: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (Alibaba). [Paper][Code (in construction)]
  • Group-Free-3D: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • VoTr: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (CUHK + NUS). [Paper]
  • 3DETR: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • DETR3D: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (MIT). [Paper]
  • M3DETR: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (University of Maryland). [Paper][PyTorch]
  • SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
  • MonoDTR: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (NTU). [Paper][Code (in construction)]
  • VoxSeT: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • TransFusion: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (HKUST). [Paper][PyTorch]
  • CAT-Det: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (Beihang University). [Paper]
  • TokenFusion: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (Tsinghua). [Paper]
  • SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
  • LIFT: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (Shanghai Jiao Tong University). [Paper]
  • BoxeR: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (University of Amsterdam). [Paper][PyTorch]
  • BrT: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (Tsinghua). [Paper]
  • VISTA: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
  • STRL: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (Bosch). [Paper]
  • MTrans: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (HKU). [Paper][PyTorch]
  • CenterFormer: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (TuSimple). [Paper][Code (in construction)]
  • BUTD-DETR: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (CMU). [Paper][PyTorch][Website]
  • SpatialDETR: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (Mercedes-Benz). [Paper][PyTorch]
  • CramNet: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (Waymo). [Paper]
  • SWFormer: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (Waymo). [Paper]
  • EMMF-Det: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 (Hikvision). [Paper]
  • UVTR: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 (CUHK). [Paper][PyTorch]
  • MsSVT: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 (Beijing Institute of Technology). [Paper]
  • DeepInteraction: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 (Fudan). [Paper][PyTorch]
  • PETR: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (Megvii). [Paper]
  • MonoDETR: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code (in construction)]
  • Graph-DETR3D: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (University of Science and Technology of China). [Paper]
  • PETRv2: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", arXiv, 2022 (Megvii). [Paper]
  • PolarFormer: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (Fudan University). [Paper][Code (in construction)]
  • AST-GRU: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • SEFormer: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (Tsinghua University). [Paper]
  • CRAFT: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (KAIST). [Paper]
  • CrossDTR: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (NTU). [Paper][Code (in construction)]
  • ?: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", arXiv, 2022 (Houmo AI, China). [Paper]
  • Focal-PETR: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • Li3DeTr: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 (University of Coimbra, Portugal). [Paper]

[Back to Overview]

Multi-Modal Detection

  • OVR-CNN: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (Snap). [Paper][PyTorch]
  • MDETR: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (NYU). [Paper][PyTorch][Website]
  • FETNet: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (Tsinghua). [Paper]
  • MEDUSA: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (Google). [Paper][PyTorch]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (Baidu). [Paper]
  • MAVL: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • OWL-ViT: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (Google). [Paper][JAX][Hugging Face]
  • X-DETR: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (Amazon). [Paper]
  • simCrossTrans: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (The City University of New York). [Paper][PyTorch]
  • ?: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (USC). [Paper]
  • YONOD: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (CUNY). [Paper][PyTorch]
  • OmDet: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (Binjiang Institute of Zhejiang University). [Paper]
  • Detection-Hub: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", arXiv, 2022 (Fudan + Microsoft). [Paper]
  • F-VLM: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", arXiv, 2022 (Google). [Paper]
  • ContFormer: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (Peking University). [Paper]
  • DQ-DETR: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 (International Digital Economy Academy (IDEA)). [Paper][Code (in construction)]

[Back to Overview]

HOI Detection

  • HOI-Transformer: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (Megvii). [Paper][PyTorch]
  • HOTR: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (Kakao + Korea University). [Paper][PyTorch]
  • MSTR: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (Kakao). [Paper]
  • SSRT: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (Amazon). [Paper]
  • CPC: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (Korea University). [Paper][PyTorch (in construction)]
  • DisTR: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (Baidu). [Paper]
  • STIP: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (JD). [Paper][PyTorch]
  • DOQ: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (South China University of Technology). [Paper]
  • UPT: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (Australian Centre for Robotic Vision). [Paper][PyTorch][Website]
  • CATN: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (Huazhong University of Science and Technology). [Paper]
  • HQM: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (South China University of Technology). [Paper][PyTorch]
  • Iwin: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (Shanghai Jiao Tong). [Paper]
  • RLIP: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch]
  • TUTOR: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 (Shanghai Jiao Tong). [Paper]
  • ?: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ?: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 (KU Leuven). [Paper]

[Back to Overview]

Salient Object Detection

  • VST: "Visual Saliency Transformer", ICCV, 2021 (Northwestern Polytechincal University). [Paper]
  • ?: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (Baidu). [Paper]
  • SwinNet: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (Anhui University). [Paper][Code]
  • SOD-Transformer: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
  • GLSTR: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (South China University of Technology). [Paper]
  • TriTransNet: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (Anhui University). [Paper]
  • AbiU-Net: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (Nankai University). [Paper]
  • TranSalNet: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (Cardiff University, UK). [Paper]
  • DFTR: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (Tencent). [Paper]
  • GroupTransNet: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (Nankai university). [Paper]
  • SelfReformer: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (NTU, Singapore). [Paper]
  • DTMINet: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (CUHK). [Paper]
  • MCNet: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • SiaTrans: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (Shandong University of Science and Technology). [Paper]
  • PSFormer: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]

[Back to Overview]

Other Detection Tasks

  • X-supervised:
    • LOST: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (Valeo.ai). [Paper][PyTorch]
    • Omni-DETR: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (Amazon). [Paper][PyTorch]
    • TokenCut: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
    • WS-DETR: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (Microsoft). [Paper]
    • TRT: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • TokenCut: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
  • X-Shot Object Detection:
    • AIT: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (Academia Sinica). [Paper]
    • Meta-DETR: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (NTU Singapore). [Paper][PyTorch]
    • CAT: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
    • FCT: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (Columbia). [Paper]
    • SaFT: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (Microsoft). [Paper]
    • TENET: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (ANU). [Paper][PyTorch]
    • Meta-DETR: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (NTU, Singapore). [Paper]
    • Incremental-DETR: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (NUS). [Paper]
    • FS-DETR: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", arXiv, 2022 (Samsung). [Paper]
  • Open-World/Vocabulary:
    • OW-DETR: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (IIAI). [Paper][PyTorch]
    • DetPro: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • PromptDet: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (Meituan). [Paper][PyTorch][Website]
    • OV-DETR: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (NTU, Singapore). [Paper]
    • VL-PLM: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 (Rutgers University). [Paper][PyTorch][Website]
    • DetCLIP: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (HKUST). [Paper]
    • WWbL: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 (Tel-Aviv). [Paper][PyTorch][Demo]
    • P3OVD: "P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 (Sun Yat-sen University). [Paper]
    • OVAD: "Open-vocabulary Attribute Detection", arXiv, 2022 (University of Freiburg, Germany). [Paper][Website]
    • Open-World-DETR: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 (NUS). [Paper]
  • Pedestrian Detection:
    • PED: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (Tsinghua). [Paper][PyTorch]
    • ?: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 (ICL). [Paper]
    • Pedestron: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (IIAI). [Paper][PyTorch]
  • Lane Detection:
    • LSTR: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (Xi'an Jiaotong). [Paper][PyTorch]
    • LETR: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (UCSD). [Paper][PyTorch]
    • Laneformer: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (Huawei). [Paper]
    • TLC: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (Peking University). [Paper]
    • PersFormer: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (Shanghai AI Laboratory). [Paper][PyTorch]
    • MHVA: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 (Beihang University). [Paper]
    • PriorLane: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (Zhejiang Lab). [Paper][PyTorch]
    • CurveFormer: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (NullMax, China). [Paper]
  • Object Localization:
    • TS-CAM: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (CAS). [Paper]
    • LCTR: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (Xiamen University). [Paper]
    • ViTOL: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (Mercedes-Benz). [Paper][PyTorch]
    • SCM: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (CUHK). [Paper][PyTorch]
    • CaFT: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper]
  • Relation Detection:
    • PST: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (Amazon). [Paper]
    • PST: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (Amazon). [Paper]
    • TROI: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (NUS, Singapore). [Paper]
    • RelTransformer: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • VReBERT: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (ANU). [Paper]
  • Anomaly Detection:
    • VT-ADL: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (University of Udine, Italy). [Paper]
    • InTra: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (Fujitsu). [Paper]
    • AnoViT: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (Korea University). [Paper]
  • Cross-Domain:
    • SSTN: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • DA-DETR: "DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention", arXiv, 2021 (NTU Singapore). [Paper]
    • MTTrans: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (Beihang University). [Paper]
    • OAA-OTA: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
    • SSTA: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • Co-Salient Object Detection:
    • CoSformer: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (Nanjing University). [Paper]
  • Oriented Object Detection:
    • O2DETR: "Oriented Object Detection with Transformer", arXiv, 2021 (Baidu). [Paper]
  • Multiview Detection:
    • MVDeTr: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (ANU). [Paper]
  • Polygon Detection:
    • ?: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (Delft University of Technology, Netherlands). [Paper]
  • Drone-view:
    • TPH: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (Beihang University). [Paper]
    • TransVisDrone: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (UCF). [Paper][Code (in construction)]
  • Infrared:
    • ?: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (Chongqing University of Posts and Telecommunications). [Paper]
  • Text Detection:
    • SwinTextSpotter: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
    • TESTR: "Text Spotting Transformers", CVPR, 2022 (UCSD). [Paper][PyTorch]
    • TTS: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (Amazon). [Paper]
    • oCLIP: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 (ByteDance). [Paper]
    • TransDETR: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • ?: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
    • ?: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (University of Science and Technology Beijing). [Paper][Code (in construction)]
    • DPText-DETR: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", arXiv, 2022 (JD). [Paper][Code (in construction)]
    • DPTNet: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (Xiamen University). [Paper]
    • ATTR: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 (Fudan). [Paper]
  • Change Detection:
    • ChangeFormer: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (JHU). [Paper][PyTorch]
    • IDET: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 (Civil Aviation University of China). [Paper]
  • Edge Detection:
    • EDTER: "EDTER: Edge Detection with Transformer", CVPR, 2022 (Beijing Jiaotong University). [Paper][Code (in construction)]
    • HEAT: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (Simon Fraser). [Paper][PyTorch][Website]
  • Person Search:
    • COAT: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (Kitware). [Paper][PyTorch]
    • PSTR: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 (Tianjin University). [Paper][PyTorch]
  • Manipulation Detection:
    • ObjectFormer: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (Fudan University). [Paper]
  • Mirror Detection:
    • SATNet: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • Shadow Detection:
    • SCOTCH-SODA: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", arXiv, 2022 (University of Cambridge). [Paper]

[Back to Overview]

Segmentation

Semantic Segmentation

  • SETR: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch][Website]
  • TrSeg: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (Korea University). [Paper][PyTorch]
  • CWT: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • Segmenter: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (INRIA). [Paper][PyTorch]
  • UN-EPT: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (Amazon). [Paper][PyTorch]
  • SegFormer: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • FTN: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (Baidu). [Paper]
  • OffRoadTranSeg: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (IISER. India). [Paper]
  • MaskFormer: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", arXiv, 2021 (UIUC + Facebook). [Paper][Website]
  • TRFS: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (ETHZ). [Paper]
  • Flying-Guide-Dog: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (KIT, Germany). [Paper][Code (in construction)]
  • VSPW: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (Xiaomi). [Paper]
  • SDTP: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (?). [Paper]
  • TopFormer: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GroupViT: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (NVIDIA). [Paper][Website][PyTorch]
  • HRViT: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch]
  • GReaT: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (HKUST). [Paper]
  • SegDeformer: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (Shanghai Jiao Tong + Huawei). [Paper][PyTorch]
  • PAUMER: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 (Idiap, Switzerland). [Paper]
  • SegViT: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (The University of Adelaide, Australia). [Paper]
  • RTFormer: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (Baidu). [Paper][Paddle]
  • SegNeXt: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 (Tsinghua University). [Paper]
  • Lawin: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • PFT: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (CUHK + SenseTime). [Paper]
  • DFlatFormer: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (OPPO). [Paper]
  • FeSeFormer: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (Baidu). [Paper]
  • StructToken: "StructToken : Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (Shanghai AI Lab). [Paper]
  • TSG: "Transformer Scale Gate for Semantic Segmentation", arXiv, 2022 (Monash University, Australia). [Paper]
  • HILA: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (University of Toronto). [Paper][Website][PyTorch]
  • HLG: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (Fudan University). [Paper][PyTorch]
  • SSformer: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • NamedMask: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (Oxford). [Paper][PyTorch][Website]
  • IncepFormer: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • SeaFormer: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 (Tencent). [Paper]

[Back to Overview]

Depth Estimation

  • DPT: "Vision Transformers for Dense Prediction", ICCV, 2021 (Intel). [Paper][PyTorch]
  • TransDepth: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (Haerbin Institute of Technology + University of Trento). [Paper][PyTorch]
  • ASTransformer: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (USTC). [Paper][PyTorch]
  • MT-SfMLearner: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • DepthFormer: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (Toyota). [Paper]
  • GuideFormer: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (Agency for Defense Development, Korea). [Paper]
  • SparseFormer: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (Meta). [Paper]
  • DEST: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (NVIDIA). [Paper]
  • MonoViT: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (University of Bologna, Italy). [Paper][PyTorch]
  • Spike-Transformer: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • ?: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 (IIT Madras). [Paper]
  • GLPanoDepth: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (Nanjing University). [Paper]
  • DepthFormer: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • BinsFormer: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • SideRT: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (Meituan). [Paper]
  • MonoFormer: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (DGIST, Korea). [Paper]
  • Depthformer: "Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (Indian Institute of Technology Delhi). [Paper]
  • TODE-Trans: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (USTC). [Paper][Code (in construction)]
  • Lite-Mono: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", arXiv, 2022 (University of Twente, Netherlands). [Paper][PyTorch (in construction)]
  • ObjCAViT: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 (ICL). [Paper]
  • ROIFormer: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 (OPPO). [Paper]

[Back to Overview]

Object Segmentation

  • SOTR: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (China Agricultural University). [Paper][PyTorch]
  • Trans4Trans: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
  • Trans2Seg: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (HKU + SenseTime). [Paper][PyTorch]
  • SOIT: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (Hikvision). [Paper][PyTorch]
  • CAST: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (Berkeley). [Paper]
  • ?: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 (Aalto University, Finland). [Paper]
  • MSMFormer: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 (UT Dallas). [Paper][PyTorch]

[Back to Overview]

Other Segmentation Tasks

  • Vision-Language:
    • LSeg: "Language-driven Semantic Segmentation", ICLR, 2022 (Cornell). [Paper][PyTorch]
    • ZegFormer: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • CLIPSeg: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (University of Göttingen, Germany). [Paper][PyTorch]
    • DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (Tsinghua University). [Paper][PyTorch][Website]
    • MaskCLIP: "Extract Free Dense Labels from CLIP", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • ZegCLIP: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", arXiv, 2022 (The University of Adelaide, Australia). [Paper][PyTorch (in construction)]
  • Multi-Modal:
    • UCTNet: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (Lehigh University, Pennsylvania). [Paper]
    • CMX: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
  • Panoptic Segmentation:
    • MaX-DeepLab: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (Google). [Paper][PyTorch (conradry)]
    • SIAin: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (SI Analytics, South Korea). [Paper]
    • VPS-Transformer: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
    • CMT-DeepLab: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (Google). [Paper]
    • Panoptic-SegFormer: "Panoptic SegFormer", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
    • Mask2Former: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
    • kMaX-DeepLab: "k-means Mask Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
    • Panoptic-PartFormer: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (Peking). [Paper][PyTorch]
    • CoMFormer: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", arXiv, 2022 (Sorbonne Université, France). [Paper]
  • Instance Segmentation:
    • ISTR: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (Xiamen University). [Paper][PyTorch]
    • Mask-Transfiner: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch][Website]
    • BoundaryFormer: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (UCSD). [Paper]
    • PPT: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (ByteDance). [Paper]
    • OSFormer: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • AISFormer: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (University of Arkansas, Arkansas). [Paper][PyTorch]
    • TOIST: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
  • Universal Segmentation:
    • OneFormer: "OneFormer: One Transformer to Rule Universal Image Segmentation", arXiv, 2022 (Oregon). [Paper][PyTorch][Website]
  • Optical Flow:
    • CRAFT: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (A*STAR, Singapore). [Paper][PyTorch]
    • KPA-Flow: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (Megvii). [Paper][PyTorch (in construction)]
    • GMFlowNet: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (Rutgers). [Paper][PyTorch]
    • FlowFormer: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (CUHK). [Paper][Website]
  • Panoramic Semantic Segmentation:
    • Trans4PASS: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
  • X-Shot:
    • CyCTR: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (University of Technology Sydney). [Paper]
    • CATrans: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (Baidu). [Paper]
    • VAT: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (Korea University). [Paper][PyTorch][Website]
    • DCAMA: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (Tencent). [Paper]
    • AAFormer: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (USTC). [Paper]
    • IPMT: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (Northwestern Polytechnical University). [Paper][PyTorch]
    • TAFT: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (KAIST). [Paper]
    • MSANet: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (AiV Research Group, Korea). [Paper][PyTorch]
    • MuHS: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 (Zhejiang University). [Paper]
    • VTM: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 (KAIST). [Paper]
  • X-Supervised:
    • MCTformer: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (The University of Western Australia). [Paper][Code (in construction)]
    • AFA: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • HSG: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (Berkeley). [Paper][PyTorch]
    • ?: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (Université Paris-Saclay, France). [Paper]
    • SegSwap: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (École des Ponts ParisTech). [Paper][PyTorch][Website]
    • ViT-PCM: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (Sapienza University, Italy). [Paper][Tensorflow]
    • TransFGU: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • TransCAM: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (University of Toronto). [Paper][PyTorch]
    • WegFormer: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Tongji University, China). [Paper]
    • MaskDistill: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (KU Leuven). [Paper][PyTorch]
    • eX-ViT: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (La Trobe University, Australia). [Paper]
    • TCC: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (Alibaba). [Paper]
    • SemFormer: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Shenzhen University). [Paper][PyTorch]
    • CLIP-ES: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", arXiv, 2022 (**). [Paper][PyTorch]
  • Cross-Domain:
    • DAFormer: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch]
    • ?: "Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation", arXiv, 2022 (Boston). [Paper]
  • Continual Learning:
    • TISS: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 (Tencent). [Paper]
  • Crack Detection:
    • CrackFormer: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
  • Camouflaged Object Detection:
    • UGTR: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (Group42, Abu Dhabi). [Paper][PyTorch]
    • COD: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (Anhui University, China). [Paper][Code (in construction)]
  • Background Separation:
    • TransBlast: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (University of British Columbia). [Paper]
  • Scene Understanding:
    • BANet: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (Wuhan University). [Paper]
    • Cerberus-Transformer: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • IRISformer: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (UCSD). [Paper][Code (in construction)]
    • InvPT: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (HKUST). [Paper][PyTorch]
    • TaskPrompter: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 (HKUST). [Paper][PyTorch (in construction)]
  • 3D Segmentation:
    • Stratified-Transformer: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • CodedVTR: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (Tsinghua). [Paper]
    • M2F3D: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper][Website]
    • 3DSeg: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 (The University of Tokyo). [Paper]
    • Analogical-Network: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 (CMU). [Paper]
  • Multi-Task:
    • MTFormer: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 (CUHK). [Paper]
    • MQTransformer: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (Wuhan University). [Paper]
  • Forcasting:
    • DiffAttn: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (UIUC). [Paper][Code (in construction)]
  • LiDAR:
    • HelixNet: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (CNRS, France). [Paper][Website][PyTorch]
    • Gaussian-Radar-Transformer: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 (University of Bonn, Germany). [Paper]
  • Co-Segmentation:
    • DINO-ViT-feature: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
  • Top-Down Semantic Segmentation:
    • Trans4Map: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
  • Open-World/Vocabulary:
    • ViL-Seg: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (CUHK). [Paper]
    • ?: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • Fusioner: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 (Shanghai Jiao Tong University). [Paper][Website]
    • OVSeg: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", arXiv, 2022 (Meta). [Paper][Website]
    • SegCLIP: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", arXiv, 2022 (JD). [Paper][Code (in construction)]
    • TCL: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", arXiv, 2022 (Kakao). [Paper][Code (in construction)]
    • PACL: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", arXiv, 2022 (Meta). [Paper]
  • Surface Normal:
    • Normal-Transformer: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 (University of Technology Sydney). [Paper]
  • Applications:
    • FloodTransformer: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (BITS Pilani, India). [Paper]

[Back to Overview]

Video (High-level)

Action Recognition

  • RGB mainly
    • Action Transformer: "Video Action Transformer Network", CVPR, 2019 (DeepMind). [Paper][Code (ppriyank)]
    • ViViT-Ensemble: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (Alibaba). [Paper]
    • TimeSformer: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (Facebook). [Paper][PyTorch (lucidrains)]
    • MViT: "Multiscale Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VidTr: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (Amazon). [Paper][PyTorch]
    • ViViT: "ViViT: A Video Vision Transformer", ICCV, 2021 (Google). [Paper][PyTorch (rishikksh20)]
    • VTN: "Video Transformer Network", ICCVW, 2021 (Theator). [Paper][PyTorch]
    • TokShift: "Token Shift Transformer for Video Classification", ACMMM, 2021 (CUHK). [Paper][PyTorch]
    • Motionformer: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (Facebook). [Paper][PyTorch][Website]
    • X-ViT: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (Samsung). [Paper][PyTorch]
    • SCT: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (Kuaishou). [Paper]
    • RSANet: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (POSTECH). [Paper][PyTorch][Website]
    • STAM: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (Alibaba). [Paper][Code]
    • GAT: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (Samsung). [Paper]
    • TokenLearner: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (Google). [Paper]
    • VLF: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (The University of Sheffield). [Paper]
    • UniFormer: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (CAS + SenstTime). [Paper][PyTorch]
    • Video-Swin: "Video Swin Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • DirecFormer: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (University of Arkansas). [Paper][Code (in construction)]
    • DVT: "Deformable Video Transformer", CVPR, 2022 (Meta). [Paper]
    • MeMViT: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (Meta). [Paper]
    • MLP-3D: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (JD). [Paper][PyTorch (in construction)]
    • RViT: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (TCL Corporate Research, HK). [Paper]
    • SIFA: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (JD). [Paper][PyTorch]
    • MViTv2: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (Meta). [Paper][PyTorch]
    • MTV: "Multiview Transformers for Video Recognition", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ORViT: "Object-Region Video Transformers", CVPR, 2022 (Tel Aviv). [Paper][Website]
    • TIME: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (KAIST). [Paper][PyTorch]
    • TPS: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • DualFormer: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • STTS: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • Turbo: "Turbo Training with Token Dropout", BMVC, 2022 (Oxford). [Paper]
    • MultiTrain: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
    • SViT: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 (Tel Aviv). [Paper][Website]
    • ST-Adapter: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 (CUHK). [Paper][Code (in construction)]
    • ATA: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 (Microsoft). [Paper]
    • AIA: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (University of Science and Technology of China). [Paper][PyTorch]
    • MSCA: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (Nagoya Institute of Technology). [Paper]
    • VAST: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (Samsung). [Paper]
    • Video-MobileFormer: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (Microsoft). [Paper]
    • MAM2: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (Baidu). [Paper]
    • ?: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (SenseTime). [Paper]
    • STAN: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (The University of Surrey, UK). [Paper]
    • AdaMAE: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", arXiv, 2022 (JHU). [Paper][PyTorch]
    • UniFormerV2: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", arXiv, 2022 (CAS). [Paper][PyTorch]
    • PatchBlender: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 (Mila). [Paper]
    • TubeViT: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", arXiv, 2022 (Google). [Paper]
  • Depth:
    • Trear: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (Tianjing University). [Paper]
  • Pose/Skeleton:
    • ST-TR: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (Polytechnic University of Milan). [Paper]
    • AcT: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
    • STAR: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (UCLA). [Paper]
    • GCsT: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (CAS). [Paper]
    • GL-Transformer: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • ?: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (University of Delaware). [Paper]
    • FG-STFormer: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (Zhengzhou University). [Paper]
    • STTFormer: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (Xidian University). [Paper][Code (in construction)]
    • ProFormer: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • ?: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (Harbin Institute of Technology). [Paper]
    • HyperSA: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 (University of Mannheim, Germany). [Paper]
    • STAR-Transformer: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (Keimyung University, Korea). [Paper]
  • Multi-modal:
    • MBT: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (Google). [Paper]
    • MM-ViT: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (OPPO). [Paper]
    • MMT-NCRC: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (UCF). [Paper][Code (in construction)]
    • M&M: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (Google). [Paper]
    • VT-CE: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (A*STAR). [Paper]
    • Hi-TRS: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (Rutgers). [Paper][PyTorch]
    • MVFT: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (Alibaba). [Paper]
    • MOV: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (Google). [Paper]
    • MotionBERT: "MotionBERT: Unified Pretraining for Human Motion Analysis", arXiv, 2022 (Peking University). [Paper][Code (in construction)][Website]
    • HIT: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (NTHU). [Paper][Code (in construction)]
  • Group Activity:
    • GroupFormer: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (Sensetime). [Paper]
    • ?: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (Hitachi). [Paper]

[Back to Overview]

Action Detection/Localization

  • OadTR: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • RTD-Net: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • FS-TAL: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • LSTR: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (Amazon). [Paper][PyTorch][Website]
  • ATAG: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (Alibaba). [Paper]
  • TAPG-Transformer: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • TadTR: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (Alibaba). [Paper][Code (in construction)]
  • Vidpress-Soccer: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (Baidu). [Paper][GitHub]
  • MS-TCT: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (INRIA). [Paper][PyTorch]
  • UGPT: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • TubeR: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (Amazon). [Paper]
  • DDM-Net: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
  • ?: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (ByteDance). [Paper][PyTorch]
  • ?: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (Renmin University of China). [Paper]
  • EAMAT: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (Beijing Institute of Technology). [Paper][Code (in construction)]
  • STPT: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (Monash University, Australia). [Paper]
  • TeSTra: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TALLFormer: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (UNC). [Paper][PyTorch]
  • ?: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • ActionFormer: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (UW-Madison). [Paper][PyTorch]
  • ActionFormer: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper][Pytorch]
  • CoOadTR: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (Aarhus University, Denmark). [Paper][PyTorch]
  • Temporal-Perceiver: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (Nanjing University). [Paper]
  • LocATe: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (Stanford). [Paper]
  • HTNet: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (Korea University). [Paper]
  • AdaPerFormer: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (Tianjin University). [Paper][PyTorch]
  • CWC-Trans: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (Meituan). [Paper]
  • MUPPET: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 (Meta). [Paper][Code (in construction)]

[Back to Overview]

Action Prediction/Anticipation

  • AVT: "Anticipative Video Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • TTPP: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 (CAS). [Paper]
  • HORST: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (NVIDIA). [Paper][PyTorch]
  • ?: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (A*STAR). [Paper]
  • FUTR: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (POSTECH). [Paper]
  • VPTR: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (Polytechnique Montreal, Canada). [Paper][PyTorch]
  • Earthformer: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 (Amazon). [Paper]
  • InAViT: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 (A*STAR). [Paper]
  • VPTR: "Video Prediction by Efficient Transformers", IVC, 2022 (Polytechnique Montreal, Canada). [Paper][Pytorch]
  • AFFT: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
  • GliTr: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 (McGill University, Canada). [Paper]

[Back to Overview]

Video Object Segmentation

  • GC: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (Tencent). [Paper]
  • SSTVOS: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (Modiface). [Paper][Code (in construction)]
  • JOINT: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
  • AOT: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (University of Technology Sydney). [Paper][PyTorch (yoxu515)][Code (in construction)]
  • TransVOS: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (Zhejiang University). [Paper]
  • SITVOS: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (JD). [Paper]
  • MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
  • HODOR: "Differentiable Soft-Masked Attention", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper]
  • BATMAN: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (Microsoft). [Paper]
  • AOT: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (University of Technology Sydney). [Paper][Code (in construction)]

[Back to Overview]

Video Instance Segmentation

  • VisTR: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (Meituan). [Paper][PyTorch]
  • IFC: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (Yonsei University). [Paper][PyTorch]
  • Deformable-VisTR: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (University at Buffalo). [Paper][Code (in construction)]
  • TeViT: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GMP-VIS: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (Shandong University). [Paper]
  • VMT: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (ETHZ). [Paper][GitHub][Website]
  • SeqFormer: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • MS-STS: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • MinVIS: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch]
  • VITA: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 (Yonsei University). [Paper][PyTorch]
  • IFR: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (Microsoft). [Paper]
  • DeVIS: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (TUM). [Paper][PyTorch]
  • InstanceFormer: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (Ludwig Maximilian University of Munich). [Paper][Code (in construction)]

[Back to Overview]

Other Video Tasks

  • Action Segmentation
    • ASFormer: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (Peking University). [Paper][PyTorch]
    • Bridge-Prompt: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • SC-Transformer++: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (CAS). [Paper][Code (in construction)]
    • UVAST: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (Bosch). [Paper][PyTorch]
    • ?: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (TUM). [Paper]
    • CETNet: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (Shijiazhuang Tiedao University). [Paper]
    • EUT: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (CAS). [Paper]
    • SC-Transformer: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (CAS). [Paper]
  • Video X Segmentation:
    • STT: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (Shanghai Jiao Tong). [Paper]
    • CFFM: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (ETH Zurich). [Paper][PyTorch]
    • TF-DL: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (Google). [Paper]
    • MRCFA: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (ETH Zurich). [Paper][PyTorch]
    • PolyphonicFormer: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (Wuhan University). [Paper][Code (in construction)]
    • ?: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
  • Video Object Detection:
    • TransVOD: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (Shanghai Jiao Tong + SenseTime). [Paper][Code (in construction)]
    • MODETR: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-MTL: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-DETR: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (Valeo, Egypt). [Paper]
    • PTSEFormer: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
    • TransVOD: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (Shanghai Jiao Tong + SenseTime). [Paper]
    • ?: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (Zenseact, Sweden). [Paper]
  • Dense Video Tasks (Detection + Segmentation):
    • TDViT: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (Queen's University Belfast, UK). [Paper][Code (in construction)]
  • Video Retrieval
    • SVRTN: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (Alibaba). [Paper]
  • Video Hashing
    • BTH: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (Tsinghua). [Paper][PyTorch]
  • Video-Language:
    • ActionCLIP: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • ?: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (Shanghai Jiao Tong + Oxford). [Paper][PyTorch][Website]
    • X-CLIP: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • EVL: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (CUHK). [Paper][PyTorch (in construction)]
    • STALE: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (University of Surrey, UK). [Paper][Code (in construction)]
    • ?: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 (Beijing Laboratory of Intelligent Information Technology). [Paper]
    • VLG: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 (Nanjing University). [Paper]
    • InternVideo: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • ViFi-CLIP: "Fine-tuned CLIP Models are Efficient Video Learners", arXiv, 2022 (MBZUAI). [Paper]
    • LaViLa: "Learning Video Representations from Large Language Models", arXiv, 2022 (Meta). [Paper][PyTorch][Website]
    • PromptonomyViT: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 (Tel Aviv + IBM). [Paper]
    • MovieCLIP: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (USC). [Paper][Website]
    • AIM: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 (Amazon). [Paper][PyTorch][Website]
  • X-supervised Learning:
    • LSTCL: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (Facebook). [Paper]
    • SVT: "Self-supervised Video Transformer", CVPR, 2022 (Stony Brook). [Paper][PyTorch][Website]
    • BEVT: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • SCVRL: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (Amazon). [Paper]
    • VideoMAE: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", CVPRW, 2022 (Tencent). [Paper][Code (in construction)]
    • VIMPAC: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (UNC). [Paper][PyTorch]
    • ?: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (CUHK). [Paper]
    • MAE: "Masked Autoencoders As Spatiotemporal Learners", arXiv, 2022 (Meta). [Paper]
    • OmniMAE: "OmniMAE: Single Model Masked Pretraining on Images and Videos", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (Georgia Tech). [Paper]
    • SVFormer: "SVFormer: Semi-supervised Video Transformer for Action Recognition", arXiv, 2022 (Fudan University). [Paper][PyTorch]
    • MVD: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", arXiv, 2022 (Fudan Univeristy). [Paper][Code (in construction)]
    • MaskViT: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 (Stanford). [Paper][Code (in construction)][Website]
  • X-shot:
    • ResT: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (Microsoft). [Paper]
    • ViSET: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (University of South FLorida). [Paper]
    • REST: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (Samsung). [Paper]
  • Anomaly Detection:
    • CT-D2GAN: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (NEC). [Paper]
    • ADTR: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (Shanghai Jiao Tong University). [Paper]
    • SSMCTB: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (UCF). [Paper][Code (in construction)]
    • ?: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (Korea University). [Paper]
    • CLIP-TSA: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", arXiv, 2022 (University of Arkansas). [Paper]
  • Relation Detection:
    • VidVRD: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (Zhejiang University). [Paper][PyTorch]
    • VRDFormer: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (Renmin University of China). [Paper][Code (in construction)]
    • VidSGG-BIG: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
    • RePro: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 (Zhejiang University). [Paper][PyTorch (in construction)]
  • Saliency Prediction:
    • STSANet: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (Shanghai University). [Paper]
    • UFO: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (South China University of Technology). [Paper][PyTorch]
  • Video Inpainting Detection:
    • FAST: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (Tsinghua University). [Paper]
  • Driver Activity:
    • TransDARC: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • ?: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (Jericho High School, NY). [Paper]
    • ViT-DD: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (Purdue). [Paper][PyTorch (in construction)]
  • Video Alignment:
    • DGWT: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (University of New South Wales, Australia). [Paper]
  • Sport-related:
    • Skating-Mixer: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (Southern University of Science and Technology). [Paper]
  • Action Counting:
    • TransRAC: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (ShanghaiTech). [Paper][PyTorch][Website]
  • Action Quality Assessment:
    • ?: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (Baidu). [Paper]
    • ?: "Action Quality Assessment using Transformers", arXiv, 2022 (USC). [Paper]
  • Human Interaction:
    • IGFormer: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (The University of Melbourne). [Paper]
  • Domain Adaptation:
    • UDAVT: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (University of Trento). [Paper][Code (in construction)]
  • Multi-Camera Editing:
    • TC-Transformer: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (CUHK). [Paper]

[Back to Overview]

Multi-Modality

Visual Captioning

  • General:
    • ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
    • M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
    • MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
    • SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
    • DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
    • CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
    • ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
    • LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
    • LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
    • GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
    • PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
    • VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
    • ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
    • SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
    • RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
    • GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
    • ?: "Object-Centric Unsupervised Image Captioning", ECCV, 2022 (Meta). [Paper][PyTorch]
    • UEDVC: "Unifying Event Detection and Captioning as Sequence Generation via Pre-Training", ECCV, 2022 (Renmin University of China). [Paper][PyTorch]
    • TIger: "Explicit Image Caption Editing", ECCV, 2022 (Zhejiang University). [Paper][Code]
    • DML: "Learning Distinct and Representative Modes for Image Captioning", NeurIPS, 2022 (University of Adelaide, Australia). [Paper]
    • P2C: "Paraphrasing Is All You Need for Novel Object Captioning", NeurIPS, 2022 (NTU + CMU). [Paper]
    • BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", NeurIPS, 2022 (Microsoft). [Paper]
    • CapDec: "Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP, 2022 (Tel Aviv). [Paper][Pytorch]
    • ?: "Focus! Relevant and Sufficient Context Selection for News Image Captioning", EMNLP Findings, 2022 (UC Davis). [Paper]
    • CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
    • ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
    • VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
    • SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
    • ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]
    • CLM: "Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment", arXiv, 2022 (CAS). [Paper]
    • PromptCap: "PromptCap: Prompt-Guided Task-Aware Image Captioning", arXiv, 2022 (UW). [Paper]
    • PTSN: "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • DDCap: "Exploring Discrete Diffusion Models for Image Captioning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • SCD-Net: "Semantic-Conditional Diffusion Networks for Image Captioning", arXiv, 2022 (JD). [Paper][PyTorch]
    • ARIC: "Aesthetically Relevant Image Captioning", AAAI, 2023 (Shenzhen University). [Paper][Code (in construction)]
    • UAIC: "Uncertainty-Aware Image Captioning", AAAI, 2023 (Meituan). [Paper]
    • LiMBeR: "Linearly Mapping from Image to Text Space", ICLR, 2023 (Brown University). [Paper]
    • Re-ViLM: "Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning", arXiv, 2023 (NVIDIA). [Paper]
  • Video:
    • Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
    • BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
    • ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
    • MV-GPT: "End-to-end Generative Pretraining for Multimodal Video Captioning", CVPR, 2022 (Google). [Paper]
    • VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
    • UVC-VI: "Aligning Source Visual and Target Language Domains for Unpaired Video Captioning", TPAMI, 2022 (Peking University). [Paper]
    • D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
    • VCRN: "Visual Commonsense-aware Representation Network for Video Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • RSFD: "Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning", arXiv, 2022 (Wuhan University of Technology). [Paper][Code (in construction)]
    • VLTinT: "VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning", AAAI, 2023 (University of Arkansas). [Paper]

[Back to Overview]

Visual Question Answering

  • General:
    • MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
    • M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
    • SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
    • ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
    • TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
    • UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
    • TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
    • ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
    • VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
    • Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
    • RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
    • Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
    • X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
    • UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
    • LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
    • QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
    • WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
    • ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
    • cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
    • Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
    • ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
    • MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
    • ?: "Training Vision-Language Models with Less Bimodal Supervision", Automated Knowledge Base Construction (AKBC), 2022 (Tel Aviv). [Paper]
    • REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", NeurIPS, 2022 (Microsoft). [Paper]
    • ScienceQA: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering", NeurIPS, 2022 (AI2). [Paper][PyTorch][Website]
    • FrozenBiLM: "Zero-Shot Video Question Answering via Frozen Bidirectional Language Models", NeurIPS, 2022 (INRIA). [Paper][PyTorch]
    • MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
    • MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • EnFoRe: "Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering", EMNLP, 2022 (UT Austin). [Paper]
    • CRIPP-VQA: "CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering", EMNLP, 2022 (Arizona State University). [Paper][PyTorch][Website]
    • PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
    • TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
    • ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
    • DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
    • TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
    • UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
    • CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
    • CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
    • ?: "Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering", arXiv, 2022 (CAS). [Paper]
    • VLR: "Visually Grounded VQA by Lattice-based Retrieval", arXiv, 2022 (University of Bremen, Germany). [Paper]
    • CMCL: "Cross-Modal Contrastive Learning for Robust Reasoning in VQA", arxiv, 2022 (University of Sydney). [Paper][PyTorch]
    • CL-CrossVQA: "CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering", arXiv, 2022 (LMU Munich). [Paper]
    • DANCE: "Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles", arXiv, 2022 (Microsoft). [Paper][PyTorch (in construction)][Website]
    • OFA-X: "Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations", arXiv, 2022 (University of Hamburg, Germany). [Paper][Code (in construction)]
    • VLC-BERT: "VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge", WACV, 2023 (UBC, Canada). [Paper][PyTorch]
  • Video:
    • ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
    • TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
    • SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (UMich). [Paper][Website]
    • VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
    • DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
    • ViteVQA: "Towards Video Text Visual Question Answering: Benchmark and Baseline", NeurIPS, 2022 (ByteDance). [Paper][GitHub]
    • WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
    • LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]
    • NewsVideoQA: "Watching the News: Towards VideoQA Models that can Read", arXiv, 2022 (IIIT Hyderabad, India). [Paper]

[Back to Overview]

Visual Grounding

  • General:
    • TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
    • ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
    • MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
    • TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
    • GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
    • Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
    • VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
    • UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
    • Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
    • CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH). [Paper][PyTorch]
    • MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
    • QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
    • UniTAB: "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling", ECCV, 2022 (Microsoft). [Paper]
    • TAP: "Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers", ECCV, 2022 (Adobe). [Paper][GitHub][Website]
    • YORO: "YORO - Lightweight End to End Visual Grounding", ECCVW, 2022 (Amazon). [Paper]
    • GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • ?: "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?", EMNLP, 2022 (Aix-Marseille University, France). [Paper]
    • SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
    • TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
    • HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
    • Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]
    • ClipCrop: "ClipCrop: Conditioned Cropping Driven by Vision-Language Model", arXiv, 2022 (The University of Tokyo). [Paper]
    • VL-MPAG-Net: "Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing", WACV, 2023 (Indian Institute of Science). [Paper][PyTorch][Website]
    • CLEVER: "Visually Grounded Commonsense Knowledge Acquisition", AAAI, 2023 (Tsinghua University). [Paper][PyTorch]
    • ?: "Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Models", ICLR, 2023 (Samsung). [Paper]
  • Video:
    • Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
    • GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
    • STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
    • DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
    • TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
    • STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
    • STCAT: "Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • VideoWhisperer: "Grounded Video Situation Recognition", NeurIPS, 2022 (IIIT Hyderabad, India). [Paper][Website]
    • VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
    • ?: "Language-free Training for Zero-shot Video Grounding", WACV, 2023 (Yonsei University). [Paper]
  • 3D:
    • ViL3DRel: "Language Conditioned Spatial Relation Reasoning for 3D Object Grounding", NeurIPS, 2022 (INRIA). [Paper][Website]
    • LAR: "Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding", NeurIPS, 2022 (KAUST). [Paper][Website]
    • 3D-CG: "3D Concept Grounding on Neural Fields", NeurIPS, 2022 (MIT). [Paper][Website]
    • UniT3D: "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding", arXiv, 2022 (TUM). [Paper]

[Back to Overview]

Multi-Modal Representation Learning

  • General:
    • LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
    • ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
    • Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
    • UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
    • VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
    • CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
    • CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (OpenAI). [Paper][PyTorch]
    • ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
    • MERLOT: "MERLOT: Multimodal Neural Script Knowledge Models", NeurIPS, 2021 (UW + AI2). [Paper][Tensorflow][Website]
    • SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
    • CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
    • Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
    • UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
    • SimVLM: "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision", ICLR, 2022 (Google). [Paper]
    • LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
    • UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
    • LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • Uni-Perceiver: "Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks", CVPR, 2022 (SenseTime). [Paper][PyTorch]
    • MERLOT-Reserve: "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound", CVPR, 2022 (UW + AI2). [Paper][JAX][Website]
    • CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
    • VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
    • VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
    • X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
    • BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
    • OFA: "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", ICML, 2022 (Alibaba). [Paper][PyTorch]
    • MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • SIMLA: "Single-Stream Multi-Level Alignment for Vision-Language Pretraining", ECCV, 2022 (Northeastern University). [Paper][PyTorch][Website]
    • Switch-BERT: "Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input", ECCV, 2022 (Ant Group). [Paper]
    • OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
    • UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
    • Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", NeurIPS, 2022 (SenseTime). [Paper][PyTorch]
    • CLOOB: "CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP", NeurIPS, 2022 (Johannes Kepler University, Austria). [Paper][PyTorch]
    • CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", NeurIPS, 2022 (UCLA). [Paper]
    • ?: "Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP", NeurIPS, 2022 (UW). [Paper][Pytorch]
    • PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", NeurIPS, 2022 (Tencent). [Paper]
    • ?: "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning", NeurIPS, 2022 (Stanford). [Paper][Website]
    • LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", NeurIPS, 2022 (Google). [Paper]
    • VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", NeurIPS, 2022 (Microsoft). [Paper][PyTorch (in construction)]
    • Knowledge-CLIP: "Contrastive Language-Image Pre-Training with Knowledge Graphs", NeurIPS, 2022 (Tsinghua). [Paper]
    • Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", NeurIPS, 2022 (DeepMind). [Paper]
    • LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", NeurIPS, 2022 (Huawei). [Paper][Code (in construction)]
    • FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", NeurIPS, 2022 (Google). [Paper]
    • LAION-5B: "LAION-5B: An open large-scale dataset for training next generation image-text models", NeurIPS (Datasets and Benchmarks), 2022 (LAION). [Paper][Website]
    • Wukong: "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark", NeurIPS (Datasets and Benchmarks), 2022 (Huawei). [Paper][Website]
    • TaiSu: "TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training", NeurIPS (Datasets and Benchmarks), 2022 (CAS). [Paper][PyTorch]
    • WinoGAViL: "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models", NeurIPS (Datasets and Benchmarks), 2022 (The Hebrew University of Jerusalem, Israel). [Paper][Website]
    • ELEVATER: "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models", NeurIPS (Datasets and Benchmarks), 2022 (Microsoft). [Paper][Website]
    • ?: "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations", NeurIPS (Datasets and Benchmarks), 2022 (UCF). [Paper][Website]
    • FaD-VLP: "FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning", EMNLP, 2022 (Meta). [Paper]
    • Omnivore: "Omnivore: A Single Model for Many Visual Modalities", arXiv, 2022 (Meta). [Paper][PyTorch]
    • MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
    • CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", arXiv, 2022 (Google). [Paper][PyTorch (lucidrains)]
    • VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", arXiv, 2022 (Microsoft). [Paper]
    • CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
    • VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
    • MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
    • LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
    • UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
    • Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
    • VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
    • BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
    • DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
    • ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
    • ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
    • Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", arXiv, 2022 (Google). [Paper]
    • VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", arXiv, 2022 (JHU). [Paper]
    • MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", arXiv, 2022 (Tsinghua + Waseda). [Paper][PyTorch]
    • ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
    • MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
    • EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
    • xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", arXiv, 2022 (Microsoft). [Paper]
    • WFH: "Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision", WACV, 2023 (Aalto University, Finland). [Paper]
    • CN-CLIP: "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese", arXiv, 2022 (Alibaba). [Paper]
    • Uni-Perceiver-v2: "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks", arXiv, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • CLOSE: "I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data", arXiv, 2022 (AI2). [Paper]
    • SVLC: "Teaching Structured Vision&Language Concepts to Vision&Language Models", arXiv, 2022 (IBM). [Paper]
    • VATLM: "VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • X2-VLM: "X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks", arXiv, 2022 (ByteDance). [Paper][Code (in construction)]
    • SkillNet: "One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code", arXiv, 2022 (Tencent). [Paper]
    • SCL: "Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning", arXiv, 2022 (Tencent). [Paper]
    • EPIC: "Leveraging per Image-Token Consistency for Vision-Language Pre-training", arXiv, 2022 (ByteDance). [Paper]
    • FLIP: "Scaling Language-Image Pre-training via Masking", arXiv, 2022 (Meta). [Paper]
    • EVA: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale", arXiv, 2022 (Beijing Academy of Artificial Intelligence (BAAI)). [Paper][PyTorch]
    • Compound-Tokens: "Compound Tokens: Channel Fusion for Vision-Language Representation Learning", arXiv, 2022 (Google). [Paper]
    • REVEAL: "REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory", arXiv, 2022 (Google). [Paper]
    • Perceiver-VL: "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention", WACV, 2023 (UNC). [Paper][PyTorch]
    • ?: "Unifying Vision-Language Representation Space with Single-tower Transformer", AAAI, 2023 (NAVER). [Paper]
    • PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", ICLR, 2023 (Google). [Paper]
    • LilT: "Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning", ICLR, 2023 (Northeastern University). [Paper]
    • CLIPs: "Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning", ICLR, 2023 (Stanford). [Paper]
    • HiCLIP: "HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention", ICLR, 2023 (Rutgers University). [Paper]
    • DeCap: "DECAP: Decoding CLIP Latents for Zero-shot Captioning", ICLR, 2023 (Zhejiang University). [Paper]
    • MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", ICLR, 2023 (Amazon). [Paper]
    • DaVinci: "Write and Paint: Generative Vision-Language Models are Unified Modal Learners", ICLR, 2023 (ByteDance). [Paper][Code (in construction)]
  • Video:
    • COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
    • Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
    • ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
    • VLM: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper][PyTorch]
    • VideoCLIP: "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding", EMNLP, 2021 (Facebook). [Paper][PyTorch]
    • VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
    • VALUE: "VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation", NeurIPS (Datasets and Benchmarks), 2021 (Microsoft). [Paper][Website]
    • TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
    • HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
    • ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
    • ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
    • CLOP: "CLOP: Video-and-Language Pre-Training with Knowledge Regularizations", ACMMM, 2022 (Baidu). [Paper]
    • VideoCC: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper][Website]
    • MUGEN: "MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration", ECCV, 2022 (Meta). [Paper][Website]
    • LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
    • EMCL: "Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • LF-VILA: "Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning", NeurIPS, 2022 (Microsoft). [Paper][GitHub]
    • VATT-GR-CL: "Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization", NeurIPS, 2022 (Google). [Paper]
    • LGDN: "LGDN: Language-Guided Denoising Network for Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
    • EgoVLP: "Egocentric Video-Language Pretraining", NeurIPS, 2022 (NUS). [Paper][PyTorch]
    • LiteVL: "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling", EMNLP, 2022 (Peking University). [Paper]
    • Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
    • All-in-One: "All in One: Exploring Unified Video-Language Pre-training", arXiv, 2022 (NUS). [Paper][PyTorch]
    • LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", arXiv, 2022 (ByteDance). [Paper][PyTorch (in construction)]
    • VIOLET: "VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • VIOLETv2: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", arXiv, 2022 (Microsoft). [Paper]
    • SimVTP: "SimVTP: Simple Video Text Pre-training with Masked Autoencoders", arXiv, 2022 (Tencent). [Paper][PyTorch (in construction)]
    • VindLU: "VindLU: A Recipe for Effective Video-and-Language Pretraining", arXiv, 2022 (UNC). [Paper][PyTorch]
    • VideoCoCa: "Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners", arXiv, 2022 (Google). [Paper]
    • i-Code: "i-Code: An Integrative and Composable Multimodal Learning Framework", AAAI, 2023 (Microsoft). [Paper]
    • TempCLR: "TempCLR: Temporal Alignment Representation with Contrastive Learning", ICLR, 2023 (Columbia). [Paper]

[Back to Overview]

Multi-Modal Retrieval

  • General:
    • Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
    • HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
    • TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
    • VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
    • CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
    • MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
    • TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
    • CODER: "CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval", ECCV, 2022 (Baidu). [Paper]
    • ?: "Most and Least Retrievable Images in Visual-Language Query Systems", ECCV, 2022 (Old Dominion University, Virginia). [Paper]
    • MACK: "MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching", NeurIPS, 2022 (CAS). [Paper]
    • MLA: "Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval", NeurIPS, 2022 (Renmin University of China). [Paper]
    • SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
    • LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
    • TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
    • HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
    • ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
    • TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
    • VLPCook: "Structured Vision-Language Pretraining for Computational Cooking", arXiv, 2022 (Sorbonne University, France). [Paper]
    • UniVL-DR: "Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval", ICLR, 2023 (Northeastern University, China). [Paper]
    • Pic2Word: "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval", arXiv, 2023 (Google). [Paper]
    • STAIR: "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens", arXiv, 2023 (Apple). [Paper]
  • Video:
    • MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
    • AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
    • HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
    • Frozen: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper][Pytorch][Website][Dataset]
    • CLIP4Clip: "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval", arXiv, 2021 (Microsoft). [Paper][PyTorch]
    • UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
    • MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
    • X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
    • MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
    • OA-Trans: "Object-aware Video-language Pre-training for Retrieval", CVPR, 2022 (NUS). [Paper][PyTorch]
    • BridgeFormer: "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
    • CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
    • X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
    • HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
    • TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
    • LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
    • MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", ECCV, 2022 (HKU). [Paper][PyTorch]
    • VTC: "VTC: Improving Video-Text Retrieval with User Comments", ECCV, 2022 (Unitary, UK). [Paper][PyTorch][Website]
    • LINAS: "Learning Linguistic Association towards Efficient Text-Video Retrieval", ECCV, 2022 (CAS). [Paper][PyTorch]
    • ?: "A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper]
    • ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
    • ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
    • RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
    • M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
    • FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]
    • Cross-Modal-Adapter: "Cross-Modal Adapter for Text-Video Retrieval", arXiv, 2022 (Tsinghua University). [Paper][PyTorch (in construction)]
    • VoP: "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval", arXiv, 2022 (Alibaba). [Paper][Code (in construction)]
    • MAC: "Masked Contrastive Pre-Training for Efficient Video-Text Retrieval", arXiv, 2022 (Alibaba). [Paper]
    • CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", ICLR, 2023 (Microsoft). [Paper][Code (in construction)]

[Back to Overview]

Multi-Modal Generation

  • General:
    • AttnGAN: "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", CVPR, 2018 (Microsoft). [Paper][PyTorch]
    • ControlGAN: "Controllable Text-to-Image Generation", NeurIPS, 2019 (Oxford). [Paper][PyTorch]
    • DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
    • CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
    • Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
    • Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
    • Make-A-Scene: "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors", ECCV, 2022 (Meta). [Paper][Video]
    • TCTIG: "Trace Controlled Text to Image Generation", ECCV, 2022 (Beihang University). [Paper]
    • CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
    • CLIPDraw: "CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders", NeurIPS, 2022 (Cross Compass, Japan). [Paper][PyTorch][Blog]
    • Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", NeurIPS, 2022 (Google). [Paper][Website]
    • ?: "Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark", NeurIPSW, 2022 (Boston + MIT + Columbia). [Paper]
    • DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
    • DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
    • ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
    • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
    • ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
    • Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
    • Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
    • PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
    • FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
    • Swinv2-Imagen: "Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation", arXiv, 2022 (Auckland University of Technology). [Paper]
    • UniTune: "UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image", arXiv, 2022 (Google). [Paper]
    • VSD: "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation", arXiv, 2022 (Tianjin University). [Paper][Code (in construction)]
    • Lafite2: "Lafite2: Few-shot Text-to-Image Generation", arXiv, 2022 (SUNY, Buffalo). [Paper]
    • eDiffi: "eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VD: "Versatile Diffusion: Text, Images and Variations All in One Diffusion Model", arXiv, 2022 (Oregon). [Paper][PyTorch]
    • SpaText: "SpaText: Spatio-Textual Representation for Controllable Image Generation", arXiv, 2022 (Meta). [Paper][Website]
    • Story-LDM: "Make-A-Story: Visual Memory Conditioned Consistent Story Generation", arXiv, 2022 (UBC + Snap). [Paper]
    • ReCo: "ReCo: Region-Controlled Text-to-Image Generation", arXiv, 2022 (Microsoft). [Paper]
    • Shifted-Diffusion: "Shifted Diffusion for Text-to-image Generation", arXiv, 2022 (ByteDance). [Paper]
    • RA-CM3: "Retrieval-Augmented Multimodal Language Modeling", arXiv, 2022 (Meta). [Paper]
    • Structure-Diffusion: "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis", arXiv, 2022 (UCSB + UC Santa Cruz). [Paper][PyTorch][Website]
    • Re-Imagen: "Re-Imagen: Retrieval-Augmented Text-to-Image Generator", ICLR, 2023 (Google). [Paper]
    • Prompt-to-Prompt: "Prompt-to-Prompt Image Editing with Cross Attention Control", ICLR, 2023 (Google). [Paper][PyTorch][Website]
    • UniD3: "Unified Discrete Diffusion for Simultaneous Vision-Language Generation", ICLR, 2023 (NTU, Singapore). [Paper]
  • Video:
    • Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", arXiv, 2022 (Meta). [Paper]
    • Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
    • Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]
    • ?: "Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization", arXiv, 2022 (CMU). [Paper][PyTorch][Website]
    • MagicVideo: "MagicVideo: Efficient Video Generation With Latent Diffusion Models", arXiv, 2022 (ByteDance). [Paper][Website]
    • MMVG: "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation", arXiv, 2022 (Meta). [Paper]
    • CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", ICLR, 2023 (Tsinghua University) [Paper][GitHub (in construction)]

[Back to Overview]

Visual Document Understanding

  • LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
  • DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
  • LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
  • TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
  • ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
  • Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
  • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
  • MGDoc: "MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding", EMNLP, 2022 (Adobe). [Paper]
  • DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
  • DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
  • DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
  • VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
  • Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
  • TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]
  • UDOP: "Unifying Vision, Text, and Layout for Universal Document Processing", arXiv, 2022 (Microsoft). [Paper]
  • Hi-VT5: "Hierarchical multimodal transformers for Multi-Page DocVQA", arXiv, 2022 (UAB, Spain). [Paper]
  • OCR-VQGAN: "OCR-VQGAN: Taming Text-within-Image Generation", WACV, 2023 (UAB, Spain). [Paper]
  • PIXEL: "Language Modelling with Pixels", ICLR, 2023 (University of Copenhagen, Denmark). [Paper]
  • Spotlight: "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus", ICLR, 2023 (Google). [Paper]
  • MaskDoc: "Masked Visual-Textual Prediction for Document Image Representation Pretraining", ICLR, 2023 (Baidu). [Paper]

[Back to Overview]

Other Multi-Modal Tasks

  • Prompt Learning/Tuning:
    • CLIP-Adapter: "CLIP-Adapter: Better Vision-Language Models with Feature Adapters", arXiv, 2021 (Shanghai AI Lab). [Paper][PyTorch]
    • CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
    • VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
    • PerVL: ""This is my unicorn, Fluffy": Personalizing frozen vision-language representations", ECCV, 2022 (NVIDIA). [Paper][PyTorch]
    • OrdinalCLIP: "OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
    • BeamCLIP: "Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching", NeurIPS, 2022 (LG). [Paper]
    • CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
    • LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", arXiv, 2022 (Samsung). [Paper]
    • VPT: "Variational prompt tuning improves generalization of vision-language models", arXiv, 2022 (Samsung). [Paper]
    • MaPLe: "MaPLe: Multi-modal Prompt Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
    • PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
    • UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
    • CPL: "CPL: Counterfactual Prompt Learning for Vision and Language Models", arXiv, 2022 (UC Santa Cruz). [Paper]
    • PTP: "Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models", arXiv, 2022 (Baidu). [Paper]
    • TaskRes: "Task Residual for Tuning Vision-Language Models", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • MVLPT: "Multitask Vision-Language Prompt Tuning", arXiv, 2022 (Berkeley). [Paper][PyTorch]
    • ILM-VP: "Understanding and Improving Visual Prompting: A Label-Mapping Perspective", arXiv, 2022 (Michigan State). [Paper][PyTorch (in construction)]
    • TaI-DP: "Texts as Images in Prompt Tuning for Multi-Label Image Recognition", arXiv, 2022 (Tomorrow Advancing Life (TAL)). [Paper][PyTorch]
    • ?: "Task Bias in Vision-Language Models", arXiv, 2022 (Columbia). [Paper]
    • DeFo: "Learning to Decompose Visual Features with Latent Textual Prompts", ICLR, 2023 (UIUC). [Paper]
    • PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", ICLR, 2023 (CMU). [Paper]
    • ?: "Visual Classification via Description from Large Language Models", ICLR, 2023 (Columbia). [Paper]
    • CSP: "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning", ICLR, 2023 (Brown University). [Paper][PyTorch]
    • ZPE: "A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models", arXiv, 2022 (Google). [Paper]
  • Zero-Shot:
    • SMs: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 (Google). [Paper][GitHub][Website]
  • X-Shot:
    • VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
    • ComCLIP: "ComCLIP: Training-Free Compositional Image and Text Matching", arXiv, 2022 (UC Santa Cruz). [Paper]
    • TCT: "Efficient Zero-shot Visual Search via Target and Context-aware Transformer", arXiv, 2022 (Baylor College of Medicine, TX). [Paper]
    • ?: "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning", ICLR, 2023 (University of Amsterdam). [Paper]
  • Segmentation:
    • VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
    • LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
    • ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
    • VLT: "VLT: Vision-Language Transformer and Query Generation for Referring Segmentation", TPAMI, 2022 (NTU, Singapore). [Paper]
  • Tracking:
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
  • Analysis:
    • MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
    • ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
    • VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
    • ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
    • VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • Speaker Localization:
    • ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
  • Multi-task:
    • UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
    • Pix2Seq: "A Unified Sequence Interface for Vision Tasks", NeurIPS, 2022 (Google). [Paper]
    • Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", ICLR, 2023 (AI2). [Paper][JAX][Website]
    • LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
  • Language-based Video Editing:
    • M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
  • Video Summarization:
    • GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
    • QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
    • HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
    • ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
    • IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
  • Robotics:
    • CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
    • TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
    • VLMbench: "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation", NeurIPS (Datasets and Benchmarks), 2022 (UC Santa Cruz). [Paper][Pytorch][Website]
  • Multi-modal Fusion:
    • MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
    • IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PPT: "PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
    • TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
    • SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
    • ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
  • Human Interaction:
    • Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
  • Sign Language Translation:
    • LWTA: "Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation", ICCV, 2021 (Cyprus University of Technology). [Paper]
  • 3D:
    • 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
    • ConceptFusion: "ConceptFusion: Open-set Multimodal 3D Mapping", arXiv, 2023 (MIT). [Paper][Website]
  • Speech Recognition:
    • AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
  • Emotion Recognition:
    • ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
    • MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
  • Sound Separation:
    • VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
    • iQuery: "iQuery: Instruments as Queries for Audio-Visual Sound Separation", arXiv, 2022 (UCSD). [Paper][Code (in construction)]
  • Language-guided Video Segmentation:
    • Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", arXiv, 2022 (Zhejiang). [Paper][PyTorch]
  • Audio-Visual:
    • AV-HuBERT: "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction", ICLR, 2022 (Meta). [Paper][PyTorch]
    • AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • AVSBench: "Audio-Visual Segmentation", ECCV, 2022 (SenseTime). [Paper][PyTorch][Website]
    • AVA-Memory: "Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment", ECCV, 2022 (KAIST). [Paper]
    • TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
    • ANGIE: "Audio-Driven Co-Speech Gesture Video Generation", NeurIPS, 2022 (CUHK). [Paper][Website]
    • MGN: "Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing", NeurIPS, 2022 (CMU + UT Austin). [Paper][PyTorch]
    • FS-RIR: "Few-Shot Audio-Visual Learning of Environment Acoustics", NeurIPS, 2022 (UT Austin). [Paper][Website]
    • u-HuBERT: "u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality", NeurIPS, 2022 (Meta). [Paper]
    • PC-VAE: "Multimodal Transformer for Parallel Concatenated Variational Autoencoders", NeurIPSW, 2022 (USC). [Paper]
    • AV-CAT: "Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers", SIGGRAPH Asia, 2022 (Tokyo Institute of Technology + Baidu). [Paper][Website]
    • Audiovisual-MAE: "Audiovisual Masked Autoencoders", arXiv, 2022 (Google). [Paper]
    • MTD: "Multimodal Transformer Distillation for Audio-Visual Synchronization", arXiv, 2022 (NTU). [Paper]
    • LAVISH: "Vision Transformers are Parameter-Efficient Audio-Visual Learners", arXiv, 2022 (UNC). [Paper][Pytorch][Website]
    • AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
    • CLIPSep: "CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos", ICLR, 2023 (Sony). [Paper]
    • CAV-MAE: "Contrastive Audio-Visual Masked Autoencoder", ICLR, 2023 (MIT + IBM). [Paper]
  • Sound Object Localization:
    • TURN: "Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch (in constrcution)]
  • Sentiment Analysis:
    • CubeMLP: "CubeMLP: A MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation", ACMMM, 2022 (Zhejiang University). [Paper]
    • MCMulT: "Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos", arXiv, 2022 (Tencent). [Paper]
  • Name Entity Recognition:
    • FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
  • Localization via Embodied Dialog:
    • LED-Bert: "Transformer-based Localization from Embodied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech). [Paper]
  • Object Captioning:
    • GRiT: "GRiT: A Generative Region-to-text Transformer for Object Understanding", arXiv, 2022 (Microsoft). [Paper][PyTorch]

[Back to Overview]


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

References