NAN Loss
zkluo opened this issue · 1 comments
zkluo commented
zkluo commented
It seems due to the SGD's unstability, reduce lr doesn't work for me. Finally, the following solution seems solve the problem:
replace
train_main_loss.update(log_main_loss.item(), batch_pixel_size)
by
if(torch.isnan(main_loss)):
logging.info("Train main loss is nan. Skipping train main loss update")
else:
train_main_loss.update(log_main_loss.item(), batch_pixel_size)