Transformer ScaledDotProductAttention energy value on 16-bit Precision.
ankitvad opened this issue · 3 comments
So - I've been trying to run my code on the GPU+16-bit precision and there is a problem with this line of the code:
if mask is not None:
energy = energy.masked_fill(mask == 0, -1e10)
The -1e10
can't be converted to half:bit precision! Consulting the paper the authors mask the value using - infinity
which is maybe represented as a very-large = -1e10
value!
A way around this is using -float('inf')
.
if mask is not None:
energy = energy.masked_fill(mask == 0, -float('inf'))
Considering that we use this energy to softmax and get the probability of attention, -float('inf)
works perfectly and works with 16-bit precision as well as multi-GPU training.
Just my solution and if someone else faces this same issue! :) ^ I think this should work!
Upon experimentation - it seems that -float(inf)
can actually cause a lot of NaN
loss issues! So - the safe way to go about this is to use -1e4
instead of -1e10
that ways it can still fit in the conversion range of 16-bit precision!
Older versions of PyTorch (1.5 for instance) can actually show NaN when doing -float(inf)
but the newer ones are fine in my prelim experiment. However, based on updates in the Huggingface Transformer library and everything it seems having this static value of the minimum possible value is not the smart thing to do! Utilizing the change they proposed and are using (huggingface/transformers#17306) :
torch.finfo(self.dtype).min
where the self.dtype can be defined like float, float16
etc. Or it can be directly obtained from the type of a float tensor.
>>> tmp = torch.randn(3,5)
>>> tmp
tensor([[ 1.1179, 0.6827, 0.3662, 0.0312, -0.1084],
[ 0.0184, -0.5863, -2.4907, -0.6222, -0.5112],
[ 0.3818, 1.9543, -1.0868, -0.7464, 0.9879]])
>>> tmp.dtype
torch.float32
>>> torch.finfo(tmp.dtype).min
-3.4028234663852886e+38