uptodiff/Transformer-in-Vision

Recent Transformer-based CV and related works.

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey:

(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

(arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]
(arXiv 2021.07) UIBert: Learning Generic Multimodal Representations for UI Understanding, [Paper]
(arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
(arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
(arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
(arXiv 2021.07) ReFormer: The Relational Transformer for Image Captioning, [Paper]
(arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
(arXiv 2021.07) Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers, [Paper]
(arXiv 2021.07) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
(arXiv 2021.07) Is Object Detection Necessary for Human-Object Interaction Recognition? [Paper]
(arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]
(arXiv 2021.07) Don’t Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers, [Paper]
(arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper], [Code]
(arXiv 2021.07) Go Wider Instead of Deeper, [Paper]
(arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives, [Paper]
(arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
(arXiv 2021.07) EAN: Event Adaptive Network for Enhanced Action Recognition, [Paper], [Code]
(arXiv 2021.07) CycleMLP: A MLP-like Architecture for Dense Prediction, [Paper], [Code]
(arXiv 2021.07) Generative Video Transformer: Can Objects be the Words? [Paper]
(arXiv 2021.07) QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries, [Paper], [Code]
(arXiv 2021.07) PICASO: Permutation-Invariant Cascaded Attentional Set Operator, [Paper], [Code]
(arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper], [Code]
(arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
(arXiv 2021.07) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
(arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) How Much Can CLIP Benefit Vision-and-Language Tasks? [Paper]
(arXiv 2021.07) Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms, [Paper], [Code]
(arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
(arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
(arXiv 2021.07) Per-Pixel Classification is Not All You Need for Semantic Segmentation, [Paper], [Project]
(arXiv 2021.07) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]
(arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
(arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper], [Code]
(arXiv 2021.07) THE BROWNIAN MOTION IN THE TRANSFORMER MODEL, [Paper]
(arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
(arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]
(arXiv 2021.07) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
(arXiv 2021.07) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
(arXiv 2021.07) LanguageRefer: Spatial-Language Model for 3D Visual Grounding, [Paper]
(arXiv 2021.07) EEG-CONVTRANSFORMER FOR SINGLE-TRIAL EEG BASED VISUAL STIMULI CLASSIFICATION, [Paper]
(arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
(arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
(arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
(arXiv 2021.07) VIDLANKD: Improving Language Understanding via Video-Distilled Knowledge Transfer, [Paper], [Code]
(arXiv 2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
(arXiv 2021.07) LEARNING VISION TRANSFORMER WITH SQUEEZE AND EXCITATION FOR FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
(arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
(arXiv 2021.07) VISION XFORMERS: EFFICIENT ATTENTION FOR IMAGE CLASSIFICATION, [Paper]
(arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper], [Code]
(arXiv 2021.07) What Makes for Hierarchical Vision Transformer? [Paper]
(arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
(arXiv 2021.07) Visual Relationship Forecasting in Videos, [Paper]
(arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
(arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
(arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
(arXiv 2021.07) CLIP-It! Language-Guided Video Summarization, [Paper], [Code]
(arXiv 2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
(arXiv 2021.07) Global Filter Networks for Image Classification, [Paper], [Code]
(arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
(arXiv 2021.07) OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, [Paper]
(arXiv 2021.07) TransSC: Transformer-based Shape Completion for Grasp Evaluation, [Paper]
(arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]
(arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
(arXiv 2021.06) Thinking Like Transformers, [Paper]
(arXiv 2021.06) Kernel Identification Through Transformers, [Paper]
(arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper]
(arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
(arXiv 2021.06) Probing Image–Language Transformers for Verb Understanding, [Paper]
(arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code], [Model]
(arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
(arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
(arXiv 2021.06) CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, [Paper], [Code]
(arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper], [Code]
(arXiv 2021.06) Transformed ROIs for Capturing Visual Transformations in Videos, [Paper]
(arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
(arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
(arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
(arXiv 2021.06) CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings, [Paper]
(arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
(arXiv 2021.06) Motion Planning Transformers: One Model to Plan Them All, [Paper]
(arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
(arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
(arXiv 2021.06) Grounding inductive biases in natural images: invariance stems from variations in data, [Paper]
(arXiv 2021.06) CoAtNet: Marrying Convolution and Attention for All Data Sizes, [Paper]
(arXiv 2021.06) Scaling Vision Transformers, [Paper]
(arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]
(arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
(arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper], [Code]
(arXiv 2021.06) MVT: MASK VISION TRANSFORMER FOR FACIAL EXPRESSION RECOGNITION IN THE WILD, [Paper]
(arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
(arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
(arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
(arXiv 2021.06) Going Beyond Linear Transformers with Recurrent Fast Weight Programmers, [Paper], [Code]
(arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
(arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
(arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
(arXiv 2021.06) VIT-INCEPTION-GAN FOR IMAGE COLOURISING, [Paper]
(arXiv 2021.06) HYBRID GENERATIVE-CONTRASTIVE REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
(arXiv 2021.06) VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning, [Paper], [Code]
(arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper], [Code]
(arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
(arXiv 2021.06) Towards Long-Form Video Understanding, [Paper], [Code]
(arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [Paper]
(arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
(arXiv 2021.06) A Picture May Be Worth a Hundred Words for Visual Question Answering, [Paper]
(arXiv 2021.06) Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training, [Paper]
(arXiv 2021.06) Shape registration in the time of transformers, [Paper]
(arXiv 2021.06) Vision Transformer Architecture Search, [Paper], [Code]
(arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
(arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
(arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
(arXiv 2021.06) Rethinking Token-Mixing MLP for MLP-based Vision Backbone, [Paper]
(arXiv 2021.06) Augmented Shortcuts for Vision Transformers, [Paper]
(arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
(arXiv 2021.06) Attention Bottlenecks for Multimodal Fusion, [Paper]
(arXiv 2021.06) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
(arXiv 2021.06) Multimodal Few-Shot Learning with Frozen Language Models, [Paper]
(arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
(arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
(arXiv 2021.06) S^2-MLP: Spatial-Shift MLP Architecture for Vision, [Paper]
(arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
(arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
(arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
(arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
(arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
(arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
(arXiv 2021.06) Semantic Correspondence with Transformers, [Paper], [Code]
(arXiv 2021.06) THE IMAGE LOCAL AUTOREGRESSIVE TRANSFORMER, [Paper]
(arXiv 2021.06) MERLOT: Multimodal Neural Script Knowledge Models, [Paper], [Project]
(arXiv 2021.06) SOLQ: Segmenting Objects by Learning Queries, [Paper], [Code]
(arXiv 2021.06) Personalizing Pre-trained Models, [Paper], [Code]
(arXiv 2021.06) E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning, [Paper]
(arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) Container: Context Aggregation Network, [Paper]
(arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper]
(arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
(arXiv 2021.06) IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Code]
(arXiv 2021.06) AudioCLIP: Extending CLIP to Image, Text and Audio, [Paper]
(arXiv 2021.06) VISION PERMUTATOR: A PERMUTABLE MLP-LIKE ARCHITECTURE FOR VISUAL RECOGNITION, [Paper], [Code]
(arXiv 2021.06) Co-advise: Cross Inductive Bias Distillation, [Paper]
(arXiv 2021.06) Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition, [Paper]
(arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
(arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
(arXiv 2021.06) Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding, [Paper]
(arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
(arXiv 2021.06) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]
(arXiv 2021.06) Multi-head or Single-head? An Empirical Comparison for Transformer Training, [Paper]
(arXiv 2021.06) Dynamic Head: Unifying Object Detection Heads with Attentions, [Paper], [Code]
(arXiv 2021.06) MLP-Mixer: An all-MLP Architecture for Vision, [Paper], [Code]
(arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
(arXiv 2021.06) Scaling Vision with Sparse Mixture of Experts, [Paper]
(arXiv 2021.06) Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition, [Paper]
(arXiv 2021.06) Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time, [Paper], [Code]
(arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
(arXiv 2021.06) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
(arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
(arXiv 2021.06) Pay Attention to MLPs, [Paper]
(arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
(arXiv 2021.06) StyTr^2: Unbiased Image Style Transfer with Transformers, [Paper]
(arXiv 2021.06) THG:Transformer with Hyperbolic Geometry, [Paper]
(arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper], [Code]
(arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
(2021.06) Reinforcement Learning as One Big Sequence Modeling Problem, [Paper], [Project]
(arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.05) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
(arXiv 2021.05) Memory-Efficient Differentiable Transformer Architecture Search, [Paper]
(arXiv 2021.05) An Attention Free Transformer, [Paper]
(arXiv 2021.05) On the Bias Against Inductive Biases, [Paper]
(arXiv 2021.05) MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation, [Paper]
(arXiv 2021.05) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]
(arXiv 2021.05) FoveaTer: Foveated Transformer for Image Classification, [Paper]
(arXiv 2021.05) UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis, [Paper]
(arXiv 2021.05) Gaze Estimation using Transformer, [Paper], [Code]
(arXiv 2021.05) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper], [Project]
(arXiv 2021.05) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
(arXiv 2021.05) Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model, [Paper]
(arXiv 2021.05) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
(arXiv 2021.05) Sequence Parallelism: Making 4D Parallelism Possible, [Paper]
(arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper], [Code]
(arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
(arXiv 2021.05) Conformer: Local Features Coupling Global Representations for Visual Recognition, [Paper], [Code]
(arXiv 2021.05) Visual Grounding with Transformers, [Paper]
(arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
(arXiv 2021.05) Are Pre-trained Convolutions Better than Pre-trained Transformers? [Paper]
(arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
(arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper], [Code]
(arXiv 2021.05) EXPLORING EXPLICIT AND IMPLICIT VISUAL RELATIONSHIPS FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.05) Computer-Aided Design as Language, [Paper]
(arXiv 2021.05) FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction, [Paper], [Project]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]
(arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
(arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.05) Towards Robust Vision Transformer, [Paper], [Code]
(arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
(arXiv 2021.05) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.05) SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition, [Paper]
(arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper]
(arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
(arXiv 2021.05) Parallel Attention Network with Sequence Matching for Video Grounding, [Paper], [Code]
(arXiv 2021.05) Relative Positional Encoding for Transformers with Linear Complexity, [Paper]
(arXiv 2021.05) VTNET: VISUAL TRANSFORMER NETWORK FOR OBJECT GOAL NAVIGATION, [Paper]
(arXiv 2021.05) DeepCAD: A Deep Generative Network for Computer-Aided Design Models, [Paper]
(arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
(arXiv 2021.05) An Attention Free Transformer, [Paper]
(arXiv 2021.05) Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks, [Paper], [Code]
(arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper]
(arXiv 2021.05) VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding, [Paper]
(arXiv 2021.05) Improving Generation and Evaluation of Visual Stories via Semantic Consistency, [Paper], [Code]
(arXiv 2021.05) BELT: Blockwise Missing Embedding Learning Transfomer, [Paper]
(arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
(arXiv 2021.05) SAT: 2D Semantics Assisted Training for 3D Visual Grounding, [Paper]
(arXiv 2021.05) Aggregating Nested Transformers, [Paper]
(arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
(arXiv 2021.05) Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation, [Paper], [Code]
(arXiv 2021.05) Perceptual Image Quality Assessment with Transformers, [Paper]
(arXiv 2021.05) Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, [Paper], [Code]
(arXiv 2021.05) Pay Attention to MLPs, [Paper]
(arXiv 2021.05) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]
(arXiv 2021.05) RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, [Paper], [Code]
(arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision? [Paper]
(arXiv 2021.05) FNet: Mixing Tokens with Fourier Transforms, [Paper]
(arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
(arXiv 2021.04) Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads, [Paper]
(arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
(arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
(arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
(arXiv 2021.04) Playing Lottery Tickets with Vision and Language, [Paper]
(arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
(arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper], [Code]
(arXiv 2021.04) MDETR-Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
(arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
(arXiv 2021.04) Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models, [Paper]
(arXiv 2021.04) Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
(arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transforme, [Paper]
(arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
(arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
(arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
(arXiv 2021.04) T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval, [Paper]
(arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
(arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper], [Code]
(arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
(arXiv 2021.04) Visual Transformer Pruning, [Paper]
(arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
(arXiv 2021.04) CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, [Paper], [Code]
(arXiv 2021.04) Lessons on Parameter Sharing across Layers in Transformers, [Paper]
(arXiv 2021.04) Disentangled Motif-aware Graph Learning for Phrase Grounding, [Paper]
(arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
(arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
(arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
(arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time? [Paper], [Code]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]
(arXiv 2021.04) Shot Contrastive Self-Supervised Learning for Scene Boundary Detection, [Paper]
(arXiv 2021.04) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
(arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
(arXiv 2021.04) TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.04) Mesh Graphormer, [Paper]
(arXiv 2021.04) TRAJEVAE - Controllable Human Motion Generation from Trajectories, [Paper]
(arXiv 2021.04) UC^2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training, [Paper]
(arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.04) Going deeper with Image Transformers, [Paper]
(arXiv 2021.04) EFFICIENT PRE-TRAINING OBJECTIVES FOR TRANSFORMERS, [Paper], [Code]
(arXiv 2021.04) ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, [Paper]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) DODRIO: Exploring Transformer Models with Interactive Visualization, [Paper], [Code]
(arXiv 2021.04) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.04) Demystifying the Better Performance of Position Encoding Variants for Transformer, [Paper]
(arXiv 2021.04) Consistent Accelerated Inference via Confident Adaptive Transformers, [Paper], [Code]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Code]
(arXiv 2021.04) Face Transformer for Recognition, [Paper], [Code]
(arXiv 2021.04) VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.04) Cross-Modal Retrieval Augmentation for Multi-Modal Classification, [Paper]
(arXiv 2021.04) Point-Based Modeling of Human Clothing, [Paper]
(arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper], [Code]
(arXiv 2021.04) Self-supervised Video Object Segmentation by Motion Grouping, [Paper], [Project]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]
(arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]
(arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]
(arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]
(arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.04) Fourier Image Transformer, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]
(arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
(arXiv 2021.03) PixelTransformer: Sample Conditioned Signal Generation, [Paper], [Code]
(arXiv 2021.03) Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation, [Paper]
(arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.03) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, [Paper], [Code]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
(arXiv 2021.03) Describing and Localizing Multiple Changes with Transformers, [Paper], [Project]
(arXiv 2021.03) COTR: Correspondence Transformer for Matching Across Images, [Paper]
(arXiv 2021.03) nderstanding Robustness of Transformers for Image Classification, [Paper]
(arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
(arXiv 2021.03) Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers, [Paper]
(arXiv 2021.03) HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper], [Code]
(arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
(arXiv 2021.03) Transformer Tracking, [Paper], [Code]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [Paper]
(arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, [Paper], [Code]
(arXiv 2021.03) On the Adversarial Robustness of Visual Transformers, [Paper]
(arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.03) Read and Attend: Temporal Localisation in Sign Language Videos, [Paper], [Benchmark]
(arXiv 2021.03) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]
(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]
(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]
(arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]
(arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]
(arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]
(arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]
(arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]
(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]
(arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]
(arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]
(arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]
(arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]
(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]
(arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]
(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]
(arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]
(arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]
(arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]
(arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]
(arXiv 2020.12) Cloud Transformers, [Paper]
(arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]
(arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]
(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]
(arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]
(arXiv 2020.06) Linformer: Self-Attention with Linear Complexity, [Paper]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]
(ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]
(ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]
(ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]
(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]
(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]
(CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]
(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]
(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

V-L representation learning (https://arxiv.org/pdf/2103.16110.pdf has provided a detailed table)