Numerical instability in Google Colab - Part 4 of Makemore

Question

Numerical instability in Google Colab - Part 4 of Makemore

sachag678 opened this issue 2 years ago · 8 comments

I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.

However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.

Not sure why this would be the case but just an odd curiosity.

karpathy commented 2 years ago

oh oh

Answer 1 · 2022-10-13T20:33:23.000Z

I'm guessing it has something to do with the python versions?

Answer 2 · 2022-10-14T01:40:17.000Z

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook.
The local Jupyter notebook version is
Python 3.7.13
The tested colab notebook version is
3.7.14 (default, Sep 8 2022, 00:06:44)
[GCC 7.5.0]

If the diff number is too small, maybe it is fine to use some way to accept it?
Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

Answer 3 · 2022-10-14T02:04:23.000Z

I used the t.grad.sum() and dt.sum() to compare the sum between colab and the local notebook.
colab.txt
local.txt

I posted it on Pytorch forum, and I got no answer: https://discuss.pytorch.org/t/numerical-instability-in-google-colab/163610
I am planning to post it on Colab Git Issues.

Answer 4 · 2022-11-07T13:54:02.000Z

Yes. I have an issue with colab. But I don't have an issue with the local VScode Jupyter notebook. The local Jupyter notebook version is Python 3.7.13 The tested colab notebook version is 3.7.14 (default, Sep 8 2022, 00:06:44) [GCC 7.5.0]

If the diff number is too small, maybe it is fine to use some way to accept it? Colab tested notebook: https://colab.research.google.com/drive/1HmZ8bgtAfvyMaZyu3Sr1Bgxsj35jitTs?usp=sharing

Maybe the issue is Pytorch version?

I am getting exactly same maxdiff for hpreact, and my notebook is running on local machine.
Python 3.9.13

&

torch.version
'1.12.1'

Answer 5 · 2023-04-29T16:26:58.000Z

I've got a strange observation (using the colab version)

dlogit_maxes = - dnorm_logits.sum(dim=1, keepdim=True) gives me exact equality
dlogit_maxes = - dnorm_logits.sum(dim=1) gives approximate equality with a maxdiff ~ 10^-8

In this exapmple - if shapes of the gradients are not equal, but the comparison is made after broadcasting (I guess) - there is a residual difference, otherwise the values equal exactly. Somehow it might have to do with the accuracy limitations of the floating point operations. In this case values are float32 and 10^-8 is close to the precision limit for float32 operations.

I've made a PR for the cmp function to output comparison of shapes, it could probably be useful: #36

Another thing is that maybe what matters is the order of the arithmetic operations. Apparently addition and multiplications of the floats are not associative https://pytorch.org/docs/stable/notes/numerical_accuracy.html

Also the doc says that there results my be inconsistent across devices, and commits in the software.

Answer 6 · 2024-04-25T18:47:37.000Z

I had the same difference problem between gradients when running locally, because I used GPU to store tensors and perform computations. Once I changed to CPU, I had the difference in the later computations because of the ordering of operations. I managed to get the exact gradients running on CPU and reordering computations to be the same as in the lecture.