microsoft/ResiDual

Numerical Scaling Trick for fp16

SirRob1997 opened this issue · 1 comments

The paper in Appendix G.3 states the following:

In ResiDual, sometimes the xdk will exceed the value range that can be expressed by FP16 and may cause training error. When this happens, a simple numeric trick is to downscale xdk to make is within the FP16 scope.

I'm wondering where and how exactly you implemented this constraint within fairseq.

Seems to be done with:

enc_res_input_norm_scale: 0.05
dec_res_input_norm_scale: 0.05