SuperMHP/GUPNet

Why does the loss become nan?

Closed this issue · 7 comments

Cc-Hy commented

Hello, during my training, the loss would always become nan, does anyone know why is it?
image

Cc-Hy commented

HELLO!
Thanks for answering, but when I use a batch size of 48 on 2 GPUs, it still becomes nan after 55 epochs.
I think the probability of all 48 samples being empty is really small.
And there is another strange thing
image
image
The seg_loss sometimes is nan but sometimes not.
And it seems that seg_loss has nothing to do with object mask and should not be nan anyhow.
image

Cc-Hy commented

Hello, I fixed the bug of nan.
But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set,
the AP is extremely low.
This is very strange, is there any bugs in the target building or output decoding?
I try to find out why is this, but failed.
Any idea?

Hello, I fixed the bug of nan. But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set, the AP is extremely low. This is very strange, is there any bugs in the target building or output decoding? I try to find out why is this, but failed. Any idea?

Hi, I have the same problem. the loss would always become nan. how did you solve this problem ?

Cc-Hy commented

Hello, I fixed the bug of nan. But I find a new strange thing, that is, when I use the well-trained epoch to infer on the train set, the AP is extremely low. This is very strange, is there any bugs in the target building or output decoding? I try to find out why is this, but failed. Any idea?

Hi, I have the same problem. the loss would always become nan. how did you solve this problem ?

My problem was caused by an incorrect function I wrote myself.
If you are using the original code without modification, I think your problem should be the batch size.

Hi, I was busy with another deadline before. It is an easy way for you to check. You can download my released ckp (trained on train set) and infer it on eval set and train set both.