The ocr loss come to nan

Question

The ocr loss come to nan

GlowingHorse opened this issue 5 years ago · 2 comments

I use my own Japanese dataset, and crop all words into single word images.
Then I use train_ocr to train the ocr network, (using e2e-mltrctw.h5 as pretrained model, but change the output size of the model from 7500 to 4748 that is number of word types of my dataset). But the loss come to nan very fast. Is there some reason for that? Thanks!

These are training loss:
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
epoch 0[0], loss: 55.214, lr: 0.00010
epoch 0[500], loss: 54.610, lr: 0.00010
epoch 0[1000], loss: 14.609, lr: 0.00010
epoch 0[1500], loss: 7.219, lr: 0.00010
epoch 0[2000], loss: 6.109, lr: 0.00010
epoch 0[2500], loss: 5.536, lr: 0.00010
epoch 0[3000], loss: 4.826, lr: 0.00010
epoch 0[3500], loss: 4.030, lr: 0.00010
epoch 0[4000], loss: 3.301, lr: 0.00010
epoch 0[4500], loss: nan, lr: 0.00010
epoch 1[5000], loss: nan, lr: 0.00010
save model: backup2/E2E_5000.h5
epoch 1[5500], loss: nan, lr: 0.00010
epoch 1[6000], loss: nan, lr: 0.00010
epoch 1[6500], loss: nan, lr: 0.00010
epoch 1[7000], loss: nan, lr: 0.00010
epoch 1[7500], loss: nan, lr: 0.00010
epoch 1[8000], loss: nan, lr: 0.00010
epoch 1[8500], loss: nan, lr: 0.00010
epoch 1[9000], loss: nan, lr: 0.00010
epoch 1[9500], loss: nan, lr: 0.00010

Answer 1 · 2019-09-17T10:45:50.000Z

That's weird. When I increase the network output from 4748 to 4900. The network can be trained longer. For now, 'nan' has not appeared. I will report it tomorrow.

Answer 2 · 2019-09-18T03:38:12.000Z

It seems has been solved. I just set the output channel to be bigger than your target number (don't just plus one, try to one hundred more than target number)