Deep-Transformer-PaperList

As Transformer has become a dominant architecture in NLP, how to increase the capacity of Transformer and make it trainable is a recent research interest. This repo focuses on papers related to training a deep Transformer.

PaperList

  • Training Deeper Neural Machine Translation Models with Transparent Attention [ACL18]
  • Learning Deep Transformer Models for Machine Translation [ACL19]
  • Fixup Initialization: Residual Learning Without Normalization [ICLR19]
  • Depth Growing for Neural Machine Translation [ACL19]
  • Improving Transformer Optimization Through Better Initialization[ICML20]
  • Multiscale Collaborative Deep Models for Neural Machine Translation [ACL20]
  • Lipschitz Constrained Parameter Initialization for Deep Transformers [ACL20]
  • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention [ACL20]
  • ReZero is All You Need: Fast Convergence at Large Depth [arxiv]
  • Understanding the Difficulty of Training Transformers [arxiv]
  • Very Deep Transformers for Neural Machine Translation [arxiv]