Numerical instability in Google Colab - Part 4 of Makemore
sachag678 opened this issue · 8 comments
I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.
However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.
Not sure why this would be the case but just an odd curiosity.
oh oh
I'm guessing it has something to do with the python versions?
Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook.
The local Jupyter notebook version is
Python 3.7.13
The tested colab notebook version is
3.7.14 (default, Sep 8 2022, 00:06:44)
[GCC 7.5.0]
If the diff number is too small, maybe it is fine to use some way to accept it?
Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing
Maybe the issue is Pytorch version?
I used the t.grad.sum() and dt.sum() to compare the sum between colab and the local notebook.
colab.txt
local.txt
I posted it on Pytorch forum, and I got no answer: https://discuss.pytorch.org/t/numerical-instability-in-google-colab/163610
I am planning to post it on Colab Git Issues.
Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook. The local Jupyter notebook version is Python 3.7.13 The tested colab notebook version is 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0]
If the diff number is too small, maybe it is fine to use some way to accept it? Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing
Maybe the issue is Pytorch version?
I am getting exactly same maxdiff
for hpreact, and my notebook is running on local machine.
Python 3.9.13
&
torch.version
'1.12.1'
I've got a strange observation (using the colab version)
dlogit_maxes = - dnorm_logits.sum(dim=1, keepdim=True)
gives me exact equality
dlogit_maxes = - dnorm_logits.sum(dim=1)
gives approximate equality with a maxdiff ~ 10^-8
In this exapmple - if shapes of the gradients are not equal, but the comparison is made after broadcasting (I guess) - there is a residual difference, otherwise the values equal exactly. Somehow it might have to do with the accuracy limitations of the floating point operations. In this case values are float32 and 10^-8 is close to the precision limit for float32 operations.
I've made a PR for the cmp function to output comparison of shapes, it could probably be useful: #36
Another thing is that maybe what matters is the order of the arithmetic operations. Apparently addition and multiplications of the floats are not associative https://pytorch.org/docs/stable/notes/numerical_accuracy.html
Also the doc says that there results my be inconsistent across devices, and commits in the software.
I had the same difference problem between gradients when running locally, because I used GPU to store tensors and perform computations. Once I changed to CPU, I had the difference in the later computations because of the ordering of operations. I managed to get the exact gradients running on CPU and reordering computations to be the same as in the lecture.