Nan Error while training on coco/pascalvoc/custom dataset after random number of steps

Question

Nan Error while training on coco/pascalvoc/custom dataset after random number of steps

Closed this issue 7 years ago · 3 comments

I tried training luminoth with standard existing coco, pascal voc and on my own dataset converted to coco format (eventually all were transformed to tfrecords format using transform tool). After running for an hour or couple of hours, each time training terminated with the same error - Error trace attached.

I am issuing a simple lumi train -c config.xml command. I have also attached the dataset dir zip and the config file for reference.

train error trace.txt
tf.zip
config.zip

Answer 1 · 2018-09-03T22:36:43.000Z

Looks like the loss is exploding. Try to lower your learning rate.

What kind of images do you have? Your tfrecords file is only about 77 KB. Are your images tiny? How many are there? Can you post examples?

Answer 2 · 2018-09-04T05:13:39.000Z

You are right , the error trace suggests that the loss is exploding...

INFO:tensorflow:step: 1622, file: b'TrainImg01.png', train_loss: 676.6780395507812, in 2.88s
INFO:tensorflow:step: 1623, file: b'TrainImg02.png', train_loss: 388.91082763671875, in 2.97s

I am currently trying to move to a faster GPU based hardware to reducing training time and then I will probably try your suggestion of lowering the learning rate.

My images are simply a bunch of different web page screenshots. Attached one such image for reference. Mostly white background with bunch of UI Controls, I am guessing that's why the tfrecord file is small in size.

On a side note, I was able to complete a couple of training runs yesterday without the error. Guess the issue is at best intermittent

Answer 3 · 2018-09-06T05:40:31.000Z

Since This error did not materialize after I moved to a faster GPU based hardware, hence closing this issue