- Recurrent Models of Visual Attention [2014 deepmind NIPS]
- Neural Machine Translation by Jointly Learning to Align and Translate [ICLR 2015]
- Efficient Transformers: A Survey [paper]
- A Survey on Visual Transformer [paper]
- Transformers in Vision: A Survey [paper]
- Sequence to Sequence Learning with Neural Networks [NIPS 2014] [paper] [code]
- End-To-End Memory Networks [NIPS 2015] [paper] [code]
- Attention is all you need [NIPS 2017] [paper] [code]
- Bidirectional Encoder Representations from Transformers: BERT [paper] [code] [pretrained-models]
- Reformer: The Efficient Transformer [ICLR2020] [paper] [code]
- Linformer: Self-Attention with Linear Complexity [AAAI2020] [paper] [code]
- GPT-3: Language Models are Few-Shot Learners [NIPS 2020] [paper] [code]
- Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [INTERSPEECH 2020] [paper] [code]
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [arxiv 2021] [paper] [code]
- VIT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [VIT] [ICLR 2021] [paper] [code]
- Trained with extra private data: do not generalized well when trained on insufficient amounts of data
- DeiT: Data-efficient Image Transformers [arxiv2021] [paper] [code]
- Token-based strategy and build upon VIT and convolutional models
- T2T-ViT: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [arxiv2021] [paper] [code]
- Transformer in Transformer [arxiv 2021] [paper] [code1] [code-official]
- OmniNet: Omnidirectional Representations from Transformers [arxiv2021] [paper]
- Convolutional Cifar10
- vision-transformers-cifar10
- Found that performance was worse than simple resnet18
- The influence of hyper-parameters: dim of vit, etc.
- ViT-pytorch
- Using pretrained weights can get better results
- DETR: End-to-End Object Detection with Transformers [ECCV2020] [paper] [code]
- Deformable DETR: Deformable Transformers for End-to-End Object Detection [ICLR2021] [paper] [code]
- End-to-End Object Detection with Adaptive Clustering Transformer [arxiv2020] [paper]
- UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [[arxiv2020] [paper]
- Rethinking Transformer-based Set Prediction for Object Detection [arxiv2020] [paper] [zhihu]
- End-to-end Lane Shape Prediction with Transformers [WACV 2021] [paper] [code]
- ViT-FRCNN: Toward Transformer-Based Object Detection [arxiv2020] [paper]
- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [arxiv 2021] [paper] [code]
- SETR : Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [arxiv2021] [paper] [code]
- Trans2Seg: Transparent Object Segmentation with Transformer [arxiv2021] [paper] [code]
- End-to-End Video Instance Segmentation with Transformers [arxiv2020] [paper] [zhihu]
- MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers [arxiv2020] [paper]
- Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [arxiv 2020] [paper] [code]
- TransGAN: Two Transformers Can Make One Strong GAN [paper] [code]
- Taming Transformers for High-Resolution Image Synthesis [paper] [code]
- iGPT: Generative Pretraining from Pixels [ICML 2020] [paper] [code]
- Generative Adversarial Transformers [arxiv 2021] [paper] [code]
- STTN: Learning Joint Spatial-Temporal Transformations for Video Inpainting [ECCV 2020] [paper] [code]
- Pre-Trained Image Processing Transformer [arxiv2020] [paper]
- TTSR: Learning Texture Transformer Network for Image Super-Resolution [CVPR2020] [paper] [code]
- Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation [ECCV 2020] [paper]
- HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation [ACMMM 2020] [paper]
- End-to-End Human Pose and Mesh Reconstruction with Transformers [arxiv 2020] [paper]
- 3D Human Pose Estimation with Spatial and Temporal Transformers [arxiv 2020] [paper] [code]
- Multimodal Motion Prediction with Stacked Transformers [CVPR 2021] [paper] [code]
- Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case [paper]
- Transformer networks for trajectory forecasting [ICPR 2020] [paper] [code]
- Spatial-Channel Transformer Network for Trajectory Prediction on the Traffic Scenes [arxiv 2021] [paper] [code]
- Pedestrian Trajectory Prediction using Context-Augmented Transformer Networks [ICRA 2020] [paper] [code]
- Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction [ECCV 2020] [paper] [code]
- Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction [paper]
- Single-Shot Motion Completion with Transformer [arxiv2021] [paper] [code]
- Attention 机制详解1,2 zhihu1 zhihu2
- 自然语言处理中的自注意力机制(Self-attention Mechanism)
- Transformer模型原理详解 [zhihu] [csdn]
- 完全解析RNN, Seq2Seq, Attention注意力机制
- Seq2Seq and transformer implementation
- End-To-End Memory Networks [zhihu]
- Illustrating the key,query,value in attention
- Transformer in CV
Thanks for the awesome survey papers of Transformer.