xytpai/retinanet

loss is nan when I train on COCO

Closed this issue · 0 comments

syt2 commented

Hi, I use your code to train on COCO and the loss is nan like:

...
total_step:102: epoch:0, step:816/118287, loss:nan, maxMem:5257MB, time:894ms, lr:0.004693
total_step:103: epoch:0, step:824/118287, loss:nan, maxMem:5257MB, time:970ms, lr:0.004707
total_step:104: epoch:0, step:832/118287, loss:nan, maxMem:5257MB, time:993ms, lr:0.004720
total_step:105: epoch:0, step:840/118287, loss:nan, maxMem:5257MB, time:883ms, lr:0.004733
total_step:106: epoch:0, step:848/118287, loss:nan, maxMem:5257MB, time:920ms, lr:0.004747
total_step:107: epoch:0, step:856/118287, loss:nan, maxMem:5257MB, time:974ms, lr:0.004760
total_step:108: epoch:0, step:864/118287, loss:nan, maxMem:5257MB, time:914ms, lr:0.004773
total_step:109: epoch:0, step:872/118287, loss:nan, maxMem:5257MB, time:869ms, lr:0.004787
total_step:110: epoch:0, step:880/118287, loss:nan, maxMem:5257MB, time:876ms, lr:0.004800

...

when I set learning rate to 0.0001 or lower, loss can be normal.
So what's the problem