layer6ai-labs/T-Fixup

Does adding layer norm together with t-fixup makes the model even better or does t-fixup make layer norm completely unnecessary (i.e. no performance gain)?

yxchng opened this issue · 0 comments

I do not seem to get the comparison between t-fixup vs t-fixup + layer norm in the paper. Hopefully you have some insights into this and can answer. Thanks.