bentrevett/pytorch-seq2seq

Transformer ScaledDotProductAttention energy value on 16-bit Precision.

ankitvad opened this issue · 3 comments

So - I've been trying to run my code on the GPU+16-bit precision and there is a problem with this line of the code:

 if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)

The -1e10 can't be converted to half:bit precision! Consulting the paper the authors mask the value using - infinity which is maybe represented as a very-large = -1e10 value!

A way around this is using -float('inf') .

 if mask is not None:
            energy = energy.masked_fill(mask == 0, -float('inf'))

Considering that we use this energy to softmax and get the probability of attention, -float('inf) works perfectly and works with 16-bit precision as well as multi-GPU training.

Just my solution and if someone else faces this same issue! :) ^ I think this should work!

Upon experimentation - it seems that -float(inf) can actually cause a lot of NaN loss issues! So - the safe way to go about this is to use -1e4 instead of -1e10 that ways it can still fit in the conversion range of 16-bit precision!

Older versions of PyTorch (1.5 for instance) can actually show NaN when doing -float(inf) but the newer ones are fine in my prelim experiment. However, based on updates in the Huggingface Transformer library and everything it seems having this static value of the minimum possible value is not the smart thing to do! Utilizing the change they proposed and are using (huggingface/transformers#17306) :
torch.finfo(self.dtype).min where the self.dtype can be defined like float, float16 etc. Or it can be directly obtained from the type of a float tensor.

>>> tmp = torch.randn(3,5)
>>> tmp
tensor([[ 1.1179,  0.6827,  0.3662,  0.0312, -0.1084],
        [ 0.0184, -0.5863, -2.4907, -0.6222, -0.5112],
        [ 0.3818,  1.9543, -1.0868, -0.7464,  0.9879]])
>>> tmp.dtype
torch.float32
>>> torch.finfo(tmp.dtype).min
-3.4028234663852886e+38