I got 'nan' for all of the losses when I trained the encoder with a custom dataset.

Question

I got 'nan' for all of the losses when I trained the encoder with a custom dataset.

aii-masao-taketani opened this issue 3 years ago · 4 comments

aii-masao-taketani commented 3 years ago

Hi,
Thank you for sharing us the great work! I really appreciate it.

I have a question when it comes to training the encoder.
I got 'nan' for the losses while training the encoder with a custom dataset. Training went well when I used IAM dataset though.
Do you have any idea why I got 'nan' for losses when I used a custom dataset?
One thing I can think of is that since every image from IAM dataset doesn't include white space margins in front and rear of a sentence, the training went well, but the dataset I used includes some white space margins, so I got 'nan' for the losses.
Do you think that's the reason why I got 'nan' for the losses? Or do you have any other ideas?

I will be grateful for any help you can provide.

Answer 1 · 2021-08-05T17:11:57.000Z

Ugh, I'm really not sure. Can you pinpioint which loss is producing the NaN?
You'll need to look in trainer/hw_with_style_trainer.py in the run_gen() function. Just add a print or something for all the elements in losses at the end of the function.

e.g. at line 1878 add:

for loss_name,loss in losses.items():
  print('{} : {}'.format(loss_name,loss))

Answer 2 · 2021-08-06T02:50:00.000Z

Thank you for your reply.
As for training the encoder, aren't we using trainer/auto_trainer.py instead of trainer/hw_with_style_trainer.py? By looking at the config, it says "class": "AutoTrainer", so I assume you imply using AutoTrainer.
Anyway, I printed out the losses as you mentioned for auto_trainer.py. Then I got the following results.

autoLoss : 0.6755558252334595
recogLoss : 19.618896484375
autoLoss : 0.33510729670524597
recogLoss : 6.563272953033447
autoLoss : 0.15641318261623383
recogLoss : 5.291152000427246
autoLoss : 0.09177795052528381
recogLoss : 4.939859867095947
autoLoss : 0.09521833807229996
recogLoss : 4.030978202819824
autoLoss : 0.09136442840099335
recogLoss : 3.9619622230529785
autoLoss : 0.08736331760883331
recogLoss : 3.702374219894409
autoLoss : 0.46055248379707336
recogLoss : 5.220013618469238
autoLoss : 0.6196017265319824
recogLoss : 3.7991933822631836
autoLoss : 0.6192638874053955
recogLoss : 3.773500680923462
autoLoss : 0.5980427861213684
recogLoss : 3.735132932662964
autoLoss : 0.5945518612861633
recogLoss : 3.6780688762664795
autoLoss : 0.5974769592285156
recogLoss : 3.6752817630767822
autoLoss : 0.5870442986488342
recogLoss : 3.5798778533935547
autoLoss : 0.37762194871902466
recogLoss : 0.0
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan
autoLoss : nan    
recogLoss : nan

For the first few iterations it seems I was getting decent losses, but suddenly I got nan for all of the losses. The curious thing is that after getting 0.0 for recogLoss, I always get nan. I tried 3 times and all of the time that happened.
Is there anything that comes up in your mind by looking at this log?

Answer 3 · 2021-08-06T18:43:00.000Z

Sorry, I had misunderstood your problem. Unless the images of your new dataset is very different from the IAM dataset, you should be fine using the pre-trained encoder. It's purpose is purely to provide a meaningful perceptive loss.

It's interesting the recogLoss is 0.0 before everything goes NaN. That would require a perfect prediction, which I think would only happend on a zero length thing (empty image? empty label?). Look at the data going through the model+CTC when it hits the 0.0 loss.

Answer 4 · 2021-08-11T01:27:57.000Z

Thank you for answering my message.
I checked the labels and image tensors when the recogLoss becomes 0.0. It seems there is no problem for images and labels. So I have no idea why it becomes 0.0.
Anyway, I modified the code the way that the model will not update its weights when the recogLoss becomes 0.0. Then it seems training goes well.

Thank you very much for your advice once again.
I will close the issue now.