jinglescode/papers

Transformers without Tears: Improving the Normalization of Self-Attention

jinglescode opened this issue 5 years ago · 0 comments

jinglescode commented 5 years ago

Paper

Link: https://arxiv.org/abs/1910.05895
Year: 2019

Summary

ScaleNorm: normalization with a single scale parameter for faster training and better performance

Results

ScaleNorm is faster than LayerNorm
warmup free training
author propose 3 changes to Transformer: PreNorm + FixNorm + ScaleNorm

Comments

presentation: https://tnq177.github.io/data/transformers_without_tears.pdf