Why does the loss become nan?

Question

Why does the loss become nan?

Closed this issue 3 years ago · 7 comments

Hello, during my training, the loss would always become nan, does anyone know why is it?

Answer 1 · 2022-02-26T11:42:39.000Z

because you use smaller batch size. We use the random cropping augmentation, so you will find that some times there is no one object in the training image. When you use small batch size, it will lead to one batch does not including any object. It will lead to nan because the statistic in HTL in nan.

…

在 2022年2月26日，下午1:27，Cc ***@***.***> 写道： Hello, during my training, the loss would always become nan, does anyone know why is it? — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

Answer 2 · 2022-02-26T12:02:06.000Z

HELLO！
Thanks for answering, but when I use a batch size of 48 on 2 GPUs, it still becomes nan after 55 epochs.
I think the probability of all 48 samples being empty is really small.
And there is another strange thing

The seg_loss sometimes is nan but sometimes not.
And it seems that seg_loss has nothing to do with object mask and should not be nan anyhow.

Answer 3 · 2022-02-26T12:07:16.000Z

it is anomaly. I have not meet that because the code of segloss comes from the centernet. I advise you to check the environment and the GPUs. Also, some times, the communication between different gpu breaks. It will also lead to nan for the torch program

…

在 2022年2月26日，下午8:02，Cc ***@***.***> 写道： HELLO！ Thanks for answering, but when I use a batch size of 48 on 2 GPUs, it still becomes nan after 55 epochs. I think the probability of all 48 samples being empty is really small. And there is another strange thing The seg_loss sometimes is nan but sometimes not. And it seems that seg_loss has nothing to do with object mask and should not be nan anyhow. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.

Answer 4 · 2022-03-01T09:12:55.000Z

Hello, I fixed the bug of nan.
But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set,
the AP is extremely low.
This is very strange, is there any bugs in the target building or output decoding?
I try to find out why is this, but failed.
Any idea?

Answer 5 · 2022-03-08T05:58:44.000Z

Hello, I fixed the bug of nan. But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set, the AP is extremely low. This is very strange, is there any bugs in the target building or output decoding? I try to find out why is this, but failed. Any idea?

Hi, I have the same problem. the loss would always become nan. how did you solve this problem ？

Answer 6 · 2022-03-09T05:06:11.000Z

Hello, I fixed the bug of nan. But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set, the AP is extremely low. This is very strange, is there any bugs in the target building or output decoding? I try to find out why is this, but failed. Any idea?

Hi, I have the same problem. the loss would always become nan. how did you solve this problem ？

My problem was caused by an incorrect function I wrote myself.
If you are using the original code without modification, I think your problem should be the batch size.

Answer 7 · 2022-03-09T06:55:40.000Z

Hi, I was busy with another deadline before. It is an easy way for you to check. You can download my released ckp (trained on train set) and infer it on eval set and train set both.