Frequently getting NAN for losses? Halts the training process

Question

Frequently getting NAN for losses? Halts the training process

JT1316 opened this issue 4 years ago · 14 comments

During training I am getting NAN for the training losses, sometimes in the first epoch and sometimes way later. Example:

progress epoch 5 step 357 image/sec 10.4 remaining 391m
discrim_loss nan
gen_loss_GAN 1.5034107
gen_loss_L1 nan

Training process looks to be working perfectly until this and then the training process halts. Any idea?

Thank you

Answer 1 · 2020-06-03T14:57:18.000Z

(0) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[batch/_779]]
(1) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Answer 2 · 2020-07-27T15:51:41.000Z

I'm having the same issue and it's driving me crazy.

only happens with cuda when using my own images. The facades dataset successfully trained on cuda
works perfectly fine with CPU backend
works perfectly fine with directml backend

Answer 3 · 2020-07-28T23:16:51.000Z

Batch Normalization is the culprit

Answer 4 · 2020-08-04T16:35:30.000Z

@antonio0372 did you fix it?

Answer 5 · 2020-08-05T00:41:18.000Z

Hi Samantha, I did some progress. It turns out the default batch size in pix2pix is 1, which makes batch normalization pretty useless, and indeed it seemed to easily cause NaN in gradients on the GPU, but not on the CPU, which is strange. Anyway, by replacing batch normalization with per image normalization, I've got much farther and fully trained 200 epochs on the GPU (around 1.8M steps). However I later modified my dataset and got NaNs again, so it seems Tensorflow GPU is still quite sensitive/unstable dependent on several factors. Cheers

…

On Wed., 5 Aug. 2020, 02:35 Simanta Deb Turja, ***@***.***> wrote: @antonio0372 <https://github.com/antonio0372> did you fix it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UWLUJ7MKIG7NBLKDDDR7A2GHANCNFSM4NRXSBBA> .

Answer 6 · 2020-08-13T18:05:08.000Z

Hi, can you please elaborate on that solution a little bit?

By the way, for me this issue also happened when using the CPU backend.

Answer 7 · 2020-08-13T18:10:33.000Z

Hi! I've actually switched to Torch. It's training, passed 2M iterations, no problems at all so far. I'm starting to believe Tensorflow is unstable. The last thing I tried was clipping the gradients. There's no way the gradients can explode if clipped. But I still got NaNs. So I finally decided to ditch Tensorflow altogether.

…

On Fri., 14 Aug. 2020, 04:05 aaaaaaaaargh, ***@***.***> wrote: Hi, can you please elaborate on that solution a little bit? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UXXDNGBZRWV33XPS4TSAQTOFANCNFSM4NRXSBBA> .

Answer 8 · 2020-08-13T18:20:26.000Z

Antonio, thanks for your quick answer! Torch.. oh well.. I don't know anything about that, but I don't know anything about tf as well, so i guess I'm giving it a shot then :)

Answer 9 · 2020-10-04T14:09:48.000Z

Got the same problem on a generator decoder (not encoder):
Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values

And it's started only when I'm using batch size >1, and only on my own dataset.
I suppose it happens on duplicated images in dataset. @antonio0372, may you also have duplicates in yours?

Answer 10 · 2020-10-04T14:30:45.000Z

Interesting. I don't have duplicates, but probably similar enough to minimise their differences in stochastic terms.

…

On Mon., 5 Oct. 2020, 00:10 skabbit, ***@***.***> wrote: Got the same problem on a generator decoder (not encoder): Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values And it's started only when I'm using batch size >1, and only on my own dataset. I suppose it happens on duplicated images in dataset. @antonio0372 <https://github.com/antonio0372>, may you also have duplicates in yours? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6J6UUUTCTLK34OF5AMYBDSJB63RANCNFSM4NRXSBBA> .

Answer 11 · 2020-10-04T14:54:33.000Z

Thank for fast reply, @antonio0372!
Now I have 6 different reasons may probably resolve this issue, I'll check them all and write down here the results.

Answer 12 · 2020-10-04T16:08:55.000Z

Degrade Tensorflow to 1.14.0 resolve this issue.
Working fine during 150 epochs with batch size 100.

Answer 13 · 2020-10-20T14:09:32.000Z

But, I clearly advice NOT to use such a huge batch size, as it generalize MUCH worse. I tried 4-10 batch and it is give much more convenient result in a comparable amount of time.
I hope somebody find this useful.

Answer 14 · 2020-11-02T13:38:29.000Z

As mentioned by @skabbit using TensorFlow 1.14.0 (pip install tensorflow-gpu==1.14.0) seems to work fine for now. I am using anaconda on windows 10 machine.