layer6ai-labs/T-Fixup

FP16 Training

libeineu opened this issue · 2 comments

Hi! Very cool work with complex theoretical proof and exciting results! My name is Bei Li, the author of Learning Deep Transformer Models for Machine Translation. Recently. there are many works focusing on improving the deep Transformer through better initialization strategies. But a very serious problem is the FP16 training is weak when using those strategies. I wonder have you tried FP16 training in this work? It's very interesting! Looking forward to your next work!

Hello,I have run the script on WMT14 En-De task, but I cannot reproduce the 29.1 BLEU score yet. Could you please provide the training details?

Hi Bei,

Thank you for your interest in our work. In our work, both the deep models and the big model on WMT'17 En-De are trained with fp16, while WMT'17 BASE and IWSLT'14 models were trained with fp32, mainly due to training time concern. Our T-Fixup models goes well with fp16 precision, but we did notice that sometimes there is a loss scale overflow problem (especially when very deep models).

These errors are not caused by T-Fixup; instead they seem to be associated with fairseq's fp16 mode (such as in facebookresearch/fairseq#512). So far we don't know how to solve them, but did notice that reduce the learning rate helps under such situations.

For the WMT'14 En-De model reproduction, please refer to our email exchanges, or to the model parameter section in the supplementary file.