As Transformer has become a dominant architecture in NLP, how to increase the capacity of Transformer and make it trainable is a recent research interest. This repo focuses on papers related to training a deep Transformer.
- Training Deeper Neural Machine Translation Models with Transparent Attention [ACL18]
- Learning Deep Transformer Models for Machine Translation [ACL19]
- Fixup Initialization: Residual Learning Without Normalization [ICLR19]
- Depth Growing for Neural Machine Translation [ACL19]
- Improving Transformer Optimization Through Better Initialization[ICML20]
- Multiscale Collaborative Deep Models for Neural Machine Translation [ACL20]
- Lipschitz Constrained Parameter Initialization for Deep Transformers [ACL20]
- Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention [ACL20]
- ReZero is All You Need: Fast Convergence at Large Depth [arxiv]
- Understanding the Difficulty of Training Transformers [arxiv]
- Very Deep Transformers for Neural Machine Translation [arxiv]