Numerical Scaling Trick for fp16
SirRob1997 opened this issue · 1 comments
SirRob1997 commented
The paper in Appendix G.3 states the following:
In ResiDual, sometimes the xdk will exceed the value range that can be expressed by FP16 and may cause training error. When this happens, a simple numeric trick is to downscale xdk to make is within the FP16 scope.
I'm wondering where and how exactly you implemented this constraint within fairseq.
SirRob1997 commented
Seems to be done with:
ResiDual/hydra_config/opus.yaml
Lines 25 to 26 in 8682f75