lxtGH/SFSegNets

Unable to train

lucasjinreal opened this issue · 14 comments

I got Nan loss:

04-08 10:51:43.489 NaN or Inf found in input tensor.
04-08 10:51:43.559 [epoch 28], [iter 8136 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.560 NaN or Inf found in input tensor.
04-08 10:51:43.632 [epoch 28], [iter 8137 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.632 NaN or Inf found in input tensor.
04-08 10:51:43.965 [epoch 28], [iter 8138 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.965 NaN or Inf found in input tensor.
04-08 10:51:44.036 [epoch 28], [iter 8139 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:44.037 NaN or Inf found in input tensor.

lxtGH commented

image
I do not meet such error.
I try by myself. It seems normal in the first epoch.

@lxtGH 1 epoch is converged. but from epoch 2, loss become bigger and bigger:

image

lxtGH commented

What about using pt.1.4?

@lxtGH Dude, to verify your code logic, I reinstall whole enviroment including pytorch 1.4. Same behavior.
You can test it on pytorch 1.8 .... All people from mmseg can not reproduce your result...

lxtGH commented

@jinfagang What is your training setting using pt1.4 ? For mmseg, I will double check its details.

lxtGH commented

Also. what dataset you use ? Cityscapes or your custom datasets?

@lxtGH I am training on custom datasets. I am just following exactly same steps like cityspaces. However I think the biggest issue is that I also tried cityscapes same problem.

lxtGH commented

In summary, using pt.1.4 on cityscapes still have this problem?

@lxtGH Besides. I double checked pytorch 1.7. You can try pytorch 1.7, this should not be pytorch version issue. there are some place went wrong. Try debug with pytorch 1.7 to see what caused the issue.

lxtGH commented

@jinfagang OK I will re-train the model using pt1.4. I will also train the code using pt.1.7 using our lab server but the time maybe longer. Keep tune it.

lxtGH commented

@jinfagang Another thing I want to ask, which config you use for training cityscape ? Could you share the detailed training setting of cityscapes for reference.

@lxtGH I didn't made any modifications. I just using the sfresnet18 model without dsn since dsn has bug.

lxtGH commented

Hi! @jinfagang I did not meet any error,. The results are normal using pt.1.6 and pt.1.4
image