tgc1997/Transformer-in-Vision

Recent Transformer-based CV and related works.

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep update.

Resource

Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survery:

(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

(arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time? [Paper], [Code]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]
(arXiv 2021.04) Shot Contrastive Self-Supervised Learning for Scene Boundary Detection, [Paper]
(arXiv 2021.04) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
(arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
(arXiv 2021.04) TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.04) Mesh Graphormer, [Paper]
(arXiv 2021.04) TRAJEVAE - Controllable Human Motion Generation from Trajectories, [Paper]
(arXiv 2021.04) UC^2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training, [Paper]
(arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.04) Going deeper with Image Transformers, [Paper]
(arXiv 2021.04) EFFICIENT PRE-TRAINING OBJECTIVES FOR TRANSFORMERS, [Paper], [Code]
(arXiv 2021.04) ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, [Paper]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) DODRIO: Exploring Transformer Models with Interactive Visualization, [Paper], [Code]
(arXiv 2021.04) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.04) Demystifying the Better Performance of Position Encoding Variants for Transformer, [Paper]
(arXiv 2021.04) Consistent Accelerated Inference via Confident Adaptive Transformers, [Paper], [Code]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Code]
(arXiv 2021.04) Face Transformer for Recognition, [Paper], [Code]
(arXiv 2021.04) VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.04) Cross-Modal Retrieval Augmentation for Multi-Modal Classification, [Paper]
(arXiv 2021.04) Point-Based Modeling of Human Clothing, [Paper]
(arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper], [Code]
(arXiv 2021.04) Self-supervised Video Object Segmentation by Motion Grouping, [Paper], [Project]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]
(arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]
(arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]
(arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]
(arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.04) Fourier Image Transformer, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]
(arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.03) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, [Paper], [Code]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
(arXiv 2021.03) Describing and Localizing Multiple Changes with Transformers, [Paper], [Project]
(arXiv 2021.03) COTR: Correspondence Transformer for Matching Across Images, [Paper]
(arXiv 2021.03) nderstanding Robustness of Transformers for Image Classification, [Paper]
(arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
(arXiv 2021.03) Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers, [Paper]
(arXiv 2021.03) HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper], [Code]
(arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
(arXiv 2021.03) Transformer Tracking, [Paper], [Code]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [Paper]
(arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, [Paper], [Code]
(arXiv 2021.03) On the Adversarial Robustness of Visual Transformers, [Paper]
(arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.03) Read and Attend: Temporal Localisation in Sign Language Videos, [Paper], [Benchmark]
(arXiv 2021.03) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]
(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]
(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]
(arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]
(arXiv 2021.03) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]
(arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]
(arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]
(arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]
(arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]
(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]
(arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]
(arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]
(arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]
(arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]
(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]
(arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]
(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]
(arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]
(arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]
(arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]
(arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]
(arXiv 2020.12) Cloud Transformers, [Paper]
(arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]
(arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]
(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]
(arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]
(arXiv 2020.06) Linformer: Self-Attention with Linear Complexity, [Paper]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]
(ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]
(ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]
(ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]
(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]
(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]
(CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]
(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]
(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

V-L representation learning (https://arxiv.org/pdf/2103.16110.pdf has provided a detailed table)