Training Error

Question

Training Error

EyalMichaeli opened this issue 2 years ago · 16 comments

First, thank you for the great work, really inspiring!

To the point:
I'm trying to use EPE on my own data (Carla as source/fake domain, A set of real images as real domain).
I created fake_gbuffers, created patches, matched them, and all is working correctly.

For some reason, at iteration a little above 5000, the function clip_gradient_norm throws Error/Warning, and from that point on the reconstructed images are black, and all outputs are 0/nan.
I checked, and clip_gradient_norm results in a NAN value, hence the error.

Looking at the tensor itself, it seems that most values(weights) are indeed very close to 0.

My question is what do you think can cause this?
a few notes that might be relevant:

source domain is RGB, target is grayscale (I don't see why would that be a problem actually)
I have (currently, just as a test) 100 images from each domain. In general, I have a total of 100k images from each domain so that won't be a problem...

Thanks.

Answer 1 · 2022-08-16T10:59:06.000Z

Forgot to update here In case someone is struggling with this, I solved it eventually.

The reason for nan was lpips.

Instructions to solve:

set torch.autograd.set_detect_anomaly(True) at the beginning of the EPEExperiment.py to let you know where are this nan from (I saw it came from lpips), then
usually it's in ops like sqrt, pow, etc. simply add an epsilon to each one and re-run. I used epsilon=1e-08

Closed.

Answer 2 · 2022-09-17T08:36:58.000Z

@EyalMichaeli Thank you for coming back to update about how you fixed the nan problem. I am facing the same issue.
My losses and gradient norms are becoming NaN, and hence I added eps=1e-08 to the normalized tensor in LPIPS.forward() function in the lpips library. I even added 'eps' to discriminator_losses.py in /../code/epe/network
The problem of NaN still persists. I was wondering if you could shed some light as to where exactly you added eps and if you did anything else to fix this problem.

Answer 3 · 2022-09-17T08:47:48.000Z

@nmaanvi did you set torch.autograd.set_detect_anomaly(True) at the beginning of the EPEExperiment.py as stated in step 1?
This should result in PyTorch telling exactly which computation does the harm (normally it’s a specific sqrt or other math function).
So, you should simply follow that debugging methodology: run, add eps, run, add eps to a different function …… until no nans.
This took me a few training by cycles.

tip: increase the LR to reach the problematic phase faster (not too much though)

Answer 4 · 2022-09-17T08:55:13.000Z

Thank you for the prompt reply. Yes, I did set_detect_anomaly(True) at the beginning. DivBackward0 returns nan, and the part where it's throwing the problem is loss.backward() going through _run_generator() and the corresponding loss for that is the LPIPS loss.
Thanks for the tip to increase the LR!

Answer 5 · 2022-09-28T12:15:17.000Z

Hey @nmaanvi, I'm running into similar issues with nan. Did you manage to fix it?
I'm adding Epsilons as you do here but it's also not working

@EyalMichaeli Thank you for coming back to update about how you fixed the nan problem. I am facing the same issue. My losses and gradient norms are becoming NaN, and hence I added eps=1e-08 to the normalized tensor in LPIPS.forward() function in the lpips library. I even added 'eps' to discriminator_losses.py in /../code/epe/network The problem of NaN still persists. I was wondering if you could shed some light as to where exactly you added eps and if you did anything else to fix this problem.

Answer 6 · 2022-09-30T00:22:12.000Z

Hello @KacperKazan , I was able to train the model for some more iterations by reducing the learning rates of both generator and discriminator. However, the NaN problem comes up again after 100K iterations.

Answer 7 · 2022-09-30T03:30:19.000Z

In EPEExpreiment.py, the original code use "loss = 0 " to initialize the loss value, so my idea is just to modify this with "loss = 1e-08", you guys thinks it's ok or no?

Answer 8 · 2022-09-30T10:13:52.000Z

Hey @nmaanvi, hmm I really don't understand the NaN problem :( maybe we could help each other out to solve this issue

Is your training able to produce any good results? and also which generator network do you use? 'hr' or 'hr_new'

For example, for me when I use 'hr' then the training crashes after about 5000 iterations on a NaN error in the loss backwards. However when I use 'hr_new' it doesn't crash but for most of the time it just outputs black images and now recently it started outputting noise like what can be seen below.

Also, @EyalMichaeli were you able to resolve this issue fully? Just adding epsilon to functions doesn't seem to work for me

Answer 9 · 2022-09-30T10:14:51.000Z

In EPEExpreiment.py, the original code use "loss = 0 " to initialize the loss value, so my idea is just to modify this with "loss = 1e-08", you guys thinks it's ok or no?

I guess it wouldn't hurt trying it. Does it work in solving this NaN issue?

Answer 10 · 2022-10-03T04:54:17.000Z

@KacperKazan, I used 'hr' and with a reduced learning rate I could train the model(with pretty good results) until 100K iterations after which the same NaN problem crops up. When I used 'hr_new' NaNs come up after 5K iterations. I am currently trying to solve this NaN problem without changing the LR and also to train the model as per the authors' suggestion (1M iterations).

Answer 11 · 2022-10-04T13:37:12.000Z

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity)
simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.

@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here.
side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

Answer 12 · 2022-10-06T01:21:24.000Z

@KacperKazan , I tried working with 'hr_new' again and see that the generator (PassThruGen) output has NaN which is resulting in the NaNs everywhere else. After seeing your question here (#45), I just added the same sigmoid function used in 'hr' and the model seems to at least not give black images and is training (albeit poorly). How have you handled the unbounded output from the generator?

Answer 13 · 2022-10-08T10:09:24.000Z

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.

@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

Answer 14 · 2022-10-09T01:13:00.000Z

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck，crashed at 3725 interations, everything else I set as default.

Answer 15 · 2022-10-09T10:01:31.000Z

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck，crashed at 3725 interations, everything else I set as default.

I just set "spectral = False" in function "make_conv_layer" in network_factory.py, and the training process goes well so far.

Answer 16 · 2022-10-10T02:53:49.000Z

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck，crashed at 3725 interations, everything else I set as default.

I just set "spectral = False" in function "make_conv_layer" in network_factory.py, and the training process goes well so far.

it works, my solution is ugly(drop the spectral norm), and I'll try with real g-buffer to see if I can get a good result.