lxtGH/DecoupleSegNets

NAN Loss

zkluo opened this issue · 1 comments

zkluo commented

Hi, when I run 'sh ./scripts/train/train_ciytscapes_ResNet50_deeplab_decouple.sh', the loss shocks heavily, and becomes NaN finally. I use pretrained model from 'sh ./scripts/train/train_cityscapes_ResNet50_deeplab.sh'.
issue

zkluo commented

It seems due to the SGD's unstability, reduce lr doesn't work for me. Finally, the following solution seems solve the problem:

replace

train_main_loss.update(log_main_loss.item(), batch_pixel_size)

by

if(torch.isnan(main_loss)):
    
    logging.info("Train main loss is nan. Skipping train main loss update")

else:
    
    train_main_loss.update(log_main_loss.item(), batch_pixel_size)