Numerical Scaling Trick for fp16

The paper in Appendix G.3 states the following:

In ResiDual, sometimes the xdk will exceed the value range that can be expressed by FP16 and may cause training error. When this happens, a simple numeric trick is to downscale xdk to make is within the FP16 scope.

I'm wondering where and how exactly you implemented this constraint within fairseq.

Seems to be done with:

ResiDual/hydra_config/opus.yaml

Lines 25 to 26 in 8682f75

    
           enc_res_input_norm_scale: 0.05 
        
           dec_res_input_norm_scale: 0.05

	enc_res_input_norm_scale: 0.05
	dec_res_input_norm_scale: 0.05