layer6ai-labs/T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"

PythonMIT

Issues

Details for initializing FFN (MLP blocks)?
#5 opened 2 years ago by zhuchen03
1
Is it necessary to make the encoder layers equal to decoder layers?
#4 opened 2 years ago by SefaZeng
1
Question: initialization for the case of multi-head attention
#8 opened 3 years ago by t-taniai
1
Does adding layer norm together with t-fixup makes the model even better or does t-fixup make layer norm completely unnecessary (i.e. no performance gain)?
#7 opened 3 years ago by yxchng
0
T-Fixup for Language Modeling
#6 opened 4 years ago by sairams-intel
1
gradient exploding when training deep models with FP16
#3 opened 4 years ago by pluiez
1
FP16 Training
#1 opened 4 years ago by libeineu
2
Possible minor typo in the paper.
#2 opened 4 years ago by kdexd
1