jinglescode/papers

Transformers without Tears: Improving the Normalization of Self-Attention

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/abs/1910.05895
Year: 2019

Summary

  • ScaleNorm: normalization with a single scale parameter for faster training and better performance

Results

  • ScaleNorm is faster than LayerNorm
  • warmup free training
  • author propose 3 changes to Transformer: PreNorm + FixNorm + ScaleNorm

Comments

presentation: https://tnq177.github.io/data/transformers_without_tears.pdf