affinelayer/pix2pix-tensorflow

Frequently getting NAN for losses? Halts the training process

JT1316 opened this issue · 14 comments

During training I am getting NAN for the training losses, sometimes in the first epoch and sometimes way later. Example:

progress epoch 5 step 357 image/sec 10.4 remaining 391m
discrim_loss nan
gen_loss_GAN 1.5034107
gen_loss_L1 nan

Training process looks to be working perfectly until this and then the training process halts. Any idea?

Thank you

(0) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[batch/_779]]
(1) Invalid argument: Nan in summary histogram for: generator/encoder_5/conv2d/kernel/values
[[node generator/encoder_5/conv2d/kernel/values (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I'm having the same issue and it's driving me crazy.

  • only happens with cuda when using my own images. The facades dataset successfully trained on cuda
  • works perfectly fine with CPU backend
  • works perfectly fine with directml backend

Batch Normalization is the culprit

@antonio0372 did you fix it?

Hi, can you please elaborate on that solution a little bit?

By the way, for me this issue also happened when using the CPU backend.

Antonio, thanks for your quick answer! Torch.. oh well.. I don't know anything about that, but I don't know anything about tf as well, so i guess I'm giving it a shot then :)

Got the same problem on a generator decoder (not encoder):
Nan in summary histogram for: generator/decoder_5/conv2d_transpose/kernel/values

And it's started only when I'm using batch size >1, and only on my own dataset.
I suppose it happens on duplicated images in dataset. @antonio0372, may you also have duplicates in yours?

Thank for fast reply, @antonio0372!
Now I have 6 different reasons may probably resolve this issue, I'll check them all and write down here the results.

Degrade Tensorflow to 1.14.0 resolve this issue.
Working fine during 150 epochs with batch size 100.

But, I clearly advice NOT to use such a huge batch size, as it generalize MUCH worse. I tried 4-10 batch and it is give much more convenient result in a comparable amount of time.
I hope somebody find this useful.

As mentioned by @skabbit using TensorFlow 1.14.0 (pip install tensorflow-gpu==1.14.0) seems to work fine for now. I am using anaconda on windows 10 machine.