Unable to train
lucasjinreal opened this issue · 14 comments
I got Nan loss:
04-08 10:51:43.489 NaN or Inf found in input tensor.
04-08 10:51:43.559 [epoch 28], [iter 8136 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.560 NaN or Inf found in input tensor.
04-08 10:51:43.632 [epoch 28], [iter 8137 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.632 NaN or Inf found in input tensor.
04-08 10:51:43.965 [epoch 28], [iter 8138 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:43.965 NaN or Inf found in input tensor.
04-08 10:51:44.036 [epoch 28], [iter 8139 / 13576], [train main loss nan], [lr 0.001678]
04-08 10:51:44.037 NaN or Inf found in input tensor.
@lxtGH 1 epoch is converged. but from epoch 2, loss become bigger and bigger:
What about using pt.1.4?
@lxtGH Dude, to verify your code logic, I reinstall whole enviroment including pytorch 1.4. Same behavior.
You can test it on pytorch 1.8 .... All people from mmseg can not reproduce your result...
@jinfagang What is your training setting using pt1.4 ? For mmseg, I will double check its details.
Also. what dataset you use ? Cityscapes or your custom datasets?
@lxtGH I am training on custom datasets. I am just following exactly same steps like cityspaces. However I think the biggest issue is that I also tried cityscapes same problem.
In summary, using pt.1.4 on cityscapes still have this problem?
@lxtGH Yes
@lxtGH Besides. I double checked pytorch 1.7. You can try pytorch 1.7, this should not be pytorch version issue. there are some place went wrong. Try debug with pytorch 1.7 to see what caused the issue.
@jinfagang OK I will re-train the model using pt1.4. I will also train the code using pt.1.7 using our lab server but the time maybe longer. Keep tune it.
@jinfagang Another thing I want to ask, which config you use for training cityscape ? Could you share the detailed training setting of cityscapes for reference.
@lxtGH I didn't made any modifications. I just using the sfresnet18 model without dsn since dsn has bug.