Loss becomes NAN after a few iterations

Question

Loss becomes NAN after a few iterations

ecm200 opened this issue 4 years ago · 16 comments

I am training a bespoke dataset using the default Mask RCNN parameterisation (see parameters below). I have converted my bespoke dataset to COCO format, with the annotations in JSON files (polygons and bounding boxes). I have also written a bespoke image loading transform as my images are in a 12-bit format. At this time I have only a single class of object, but this is likely to increase to 5 or 6 eventually. My training and test sets has 5000 and 1500 images respectively.

I have managed to get the network training, but after a few iterations all loss functions become NAN.

I have verified the standard COCO instance example runs on my machine using the 2017 dataset without these issues. I am using an adapted version of the example training script that is supplied for training with COCO data.

After about 500 batches during training in the first Epoch, the loss functions suddenly change to NaN and don't recover. Also the validation step returns NaN, and the accuracy can seen to degrade and not recover.

Has anyone seen this before?

Configuration file for bespoke training of instance detection problem.

base_ = [
    '../configs/_base_/models/mask_rcnn_r50_fpn.py',
    '../configs/_base_/datasets/coco_instance.py',
    '../configs/_base_/schedules/schedule_1x.py', '../configs/_base_/default_runtime.py'
]
dataset_type = 'CocoDataset'
classes=['particle']
data_root = 'advanced_seg/MMDetection_experiments/datasets/spherical_test_data_v1_5000_1500/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='Resize', img_scale=(1296, 972), keep_ratio=True),
    dict(type='Normalize', **img_norm_cfg),
    #dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
    dict(type='LoadMorphologiSynImage'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1296, 972),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='Normalize', **img_norm_cfg),
            #dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=5,
    train=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/train_coco.json',
        img_prefix=data_root + 'train/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        classes=classes,
        ann_file=data_root + 'annotations/valid_coco.json',
        img_prefix=data_root + 'valid/',
        pipeline=test_pipeline))
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
total_epochs = 30
gpus = 1

Bespoke image loading transform

@PIPELINES.register_module()
class LoadMorphologiSynImage(object):

    def __init__(self, image_scale=255.0, image_format=np.float32):

        self.image_scale=image_scale
        self.image_format=image_format


    def __call__(self, results):

        if results['img_prefix'] is not None:
            filename = osp.join(results['img_prefix'],
                                results['img_info']['filename'])
        else:
            filename = results['img_info']['filename']

        img = cv2.imread(filename, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)

        img = (img / img.max()) * self.image_scale

        img = self.image_format(img)

        results['filename'] = filename
        results['img'] = img
        results['img_shape'] = img.shape
        results['ori_shape'] = img.shape
        results['flip'] = False

        # Set initial values for default meta_keys
        results['pad_shape'] = img.shape
        results['scale_factor'] = 1.0
        num_channels = 1 if len(img.shape) < 3 else img.shape[2]
        results['img_norm_cfg'] = dict(
            mean=np.zeros(num_channels, dtype=np.float32),
            std=np.ones(num_channels, dtype=np.float32),
            to_rgb=False)
        return results

Training script

mport mmcv
from mmcv import Config, DictAction
from mmcv.runner import init_dist
import torch


from mmdet import __version__
from mmdet.apis import set_random_seed, train_detector
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.utils import collect_env, get_root_logger

import __main__ as main
import os
import random
import datetime
import shutil
import copy
import time
from glob import glob
#from sklearn.model_selection import train_test_split
import albumentations as A
import numpy as np
import argparse

from mmdetection_morphologi_pipelines import LoadMorphologiSynImage

BASE_DIR = 'advanced_seg/MMDetection_experiments'
WORKFLOW = [('train',1), ('val', 1)]
def parse_args():
    parser = argparse.ArgumentParser(description='Train a detector')
    parser.add_argument('--config', help='train config file path', default=os.path.join(BASE_DIR,'configs_morph/mmdetection_morphologi_mask_rcnn_r50_fpn_1x.py'))
    parser.add_argument('--work_dir', help='the dir to save logs and models', default=os.path.join(BASE_DIR,'output'))
    parser.add_argument('--workflow', type=int, help='Workflow type [0] train only, [1] train and validate every epoch', default=2)
    parser.add_argument('--job_name', help='name for output files and dirs', default='spherical_test_data_v1_5000_1500_')
    parser.add_argument(
        '--resume-from', help='the checkpoint file to resume from')
    parser.add_argument(
        '--validate',
        action='store_true',
        help='whether to evaluate the checkpoint during training') #, default=True)
    group_gpus = parser.add_mutually_exclusive_group()
    group_gpus.add_argument(
        '--gpus',
        type=int,
        help='number of gpus to use '
        '(only applicable to non-distributed training)')
    group_gpus.add_argument(
        '--gpu-ids',
        type=int,
        nargs='+',
        help='ids of gpus to use '
        '(only applicable to non-distributed training)')
    parser.add_argument('--seed', type=int, default=42, help='random seed')
    parser.add_argument(
        '--deterministic',
        action='store_true',
        help='whether to set deterministic options for CUDNN backend.')
    parser.add_argument(
        '--options', nargs='+', action=DictAction, help='arguments in dict')
    parser.add_argument(
        '--launcher',
        choices=['none', 'pytorch', 'slurm', 'mpi'],
        default='none',
        help='job launcher')
    parser.add_argument('--local_rank', type=int, default=0)
    parser.add_argument(
        '--autoscale-lr',
        action='store_true',
        help='automatically scale lr with the number of gpus',
        default=True) #Added by ECM as this should always be used
    args = parser.parse_args()
    if 'LOCAL_RANK' not in os.environ:
        os.environ['LOCAL_RANK'] = str(args.local_rank)

    return args

if __name__ == '__main__':

    args = parse_args()

    # Output dir and job details
    job_name_preamble = args.job_name

    #### CONFIG
    ## Get the Base Config
    cfg = Config.fromfile(args.config)


    ## Set up Config
    print('------------------------------------------------------------------------------------------------------------------------')
    print('[CFG] Configuration changes from defaults.')
    print('------------------------------------------------------------------------------------------------------------------------')

    # Get additional keyword arguments for configuration
    if args.options is not None:
        cfg.merge_from_dict(args.options)

    # set cudnn_benchmark
    if cfg.get('cudnn_benchmark', False):
        torch.backends.cudnn.benchmark = True

    # Setup output dir
    if args.work_dir is not None:
        output_base_dir = args.work_dir
    elif cfg.get('work_dir', None) is None:
        output_base_dir = os.path.join(BASE_DIR,'output')
    cfg.work_dir = os.path.join(output_base_dir,job_name_preamble+cfg.model.type+'_'+cfg.model.backbone.type+str(cfg.model.backbone.depth)+'_'+cfg.model.neck.type+'_'+datetime.datetime.now().strftime('%d%m%Y_%H%M%S'))
    os.makedirs(cfg.work_dir, exist_ok=True)
    print('[CFG] Creating output directory: ', cfg.work_dir)
    shutil.copyfile(main.__file__, os.path.join(cfg.work_dir,main.__file__.split('/')[-1]))

    # Resume from previous iteration
    if args.resume_from is not None:
        cfg.resume_from = args.resume_from

    # Update the default number of GPUs if different from config.
    if args.gpu_ids is not None:
        cfg.gpu_ids = args.gpu_ids
    else:
        cfg.gpu_ids = range(1) if args.gpus is None else range(args.gpus)

    # Update the autoscaler with the number of GPUs if changed from default 8 gpus.
    # ECM Modified so it takes into account the actual mini-batch size, which is dependent on the number of gpus and the images per gpu.
    # It now scales with the default mini-batch of 8 gpus and 2 images per gpu (16), and changes in response to changes in both gpus and/or images per gpu.
    if args.autoscale_lr:
        _old_lr = cfg.optimizer['lr']
        cfg.optimizer['lr'] =cfg.optimizer['lr'] * (len(cfg.gpu_ids) * cfg.data.imgs_per_gpu) / (8*2)
        print('[CFG] Applying linear Learning Rate correction. LR changed from ', _old_lr,'to ', cfg.optimizer['lr'])

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        print('[CFG] Distributed environment has not been initialized.')
        distributed = False
    else:
        print('[CFG] Distributed environment initialising...')
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)

    # Set workflow overide
    if args.workflow == 1:
        cfg.workflow = [('train', 1)]
    elif args.workflow == 2:
        cfg.workflow = [('train', 1), ('val', 1)]

    print('------------------------------------------------------------------------------------------------------------------------')

    print('[INFO] Initialising logging...')
    print('------------------------------------------------------------------------------------------------------------------------')
    # init the logger before other steps
    timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
    log_file = os.path.join(cfg.work_dir, '{}.log'.format(timestamp))
    logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)

    # init the meta dict to record some important information such as
    # environment info and seed, which will be logged
    meta = dict()
    # log env info
    env_info_dict = collect_env()
    env_info = '\n'.join([('{}: {}'.format(k, v))
                            for k, v in env_info_dict.items()])
    dash_line = '-' * 60 + '\n'
    logger.info('Environment info:\n' + dash_line + env_info + '\n' +
                dash_line)
    meta['env_info'] = env_info
    print('------------------------------------------------------------------------------------------------------------------------')
    # log some basic info
    logger.info('Distributed training: {}'.format(distributed))
    logger.info('Config:\n{}'.format(cfg.text))

    # set random seeds
    if args.seed is not None:
        logger.info(f'Set random seed to {args.seed}, '
                    f'deterministic: {args.deterministic}')
        set_random_seed(args.seed, deterministic=args.deterministic)
    cfg.seed = args.seed
    meta['seed'] = args.seed

    model = build_detector(cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)

    datasets = [build_dataset(cfg.data.train)] #, build_dataset(cfg.data.val)]

    if len(cfg.workflow) == 2:
        val_dataset = copy.deepcopy(cfg.data.val)
        val_dataset.pipeline = cfg.data.train.pipeline
        datasets.append(build_dataset(val_dataset))
    if cfg.checkpoint_config is not None:
        # save mmdet version, config file content and class names in
        # checkpoints as meta data
        cfg.checkpoint_config.meta = dict(
            mmdet_version=__version__,
            config=cfg.text,
            CLASSES=datasets[0].CLASSES)

    # add an attribute for visualization convenience
    model.CLASSES = datasets[0].CLASSES
    train_detector(
        model,
        datasets,
        cfg,
        distributed=distributed,
        validate=args.validate,
        timestamp=timestamp,
        meta=meta)

Model training output log

loading annotations into memory...                                                                                                                                                                                                             
Done (t=59.47s)                                                                                                                                                                                                                                
creating index...                                                                                                                                                                                                                              
index created!                                                                                                                                                                                                                                 
loading annotations into memory...                                                                                                                                                                                                             
Done (t=15.04s)                                                                                                                                                                                                                                
creating index...                                                                                                                                                                                                                              
index created!                                                                                                                                                                                                                                 
2020-05-15 14:33:58,617 - mmdet - INFO - Start running, host: edmorris@willow-tree-cnn-gpu-lin64, work_dir: /home/edmorris/notebooks/Projects/WillowTree/Repo/advanced_seg/MMDetection_experiments/output/spherical_test_data_v1_5000_1500_Mas$
RCNN_ResNet50_FPN_15052020_143239                                                                                                                                                                                                              
2020-05-15 14:33:58,617 - mmdet - INFO - workflow: [('train', 1), ('val', 1)], max: 30 epochs                                                                                                                                                  
2020-05-15 14:35:35,679 - mmdet - INFO - Epoch [1][50/2500]     lr: 0.00025, eta: 1 day, 16:19:03, time: 1.937, data_time: 1.090, memory: 6226, loss_rpn_cls: 0.6667, loss_rpn_bbox: 0.3420, loss_cls: 2.2791, acc: 55.0977, loss_bbox: 0.0901$
 loss_mask: 0.7499, loss: 4.1278                                                                                                                                                                                                               
2020-05-15 14:37:07,541 - mmdet - INFO - Epoch [1][100/2500]    lr: 0.00050, eta: 1 day, 15:15:27, time: 1.837, data_time: 1.082, memory: 6226, loss_rpn_cls: 0.5279, loss_rpn_bbox: 0.2552, loss_cls: 0.4233, acc: 81.6328, loss_bbox: 0.1590$
 loss_mask: 0.5800, loss: 1.9454                                                                                                                                                                                                               
2020-05-15 14:38:42,158 - mmdet - INFO - Epoch [1][150/2500]    lr: 0.00075, eta: 1 day, 15:16:08, time: 1.892, data_time: 1.141, memory: 6226, loss_rpn_cls: 0.3954, loss_rpn_bbox: 0.2492, loss_cls: 0.3779, acc: 84.6680, loss_bbox: nan, l$
ss_mask: 0.5657, loss: nan                                                                                                                                                                                                                     
2020-05-15 14:40:16,140 - mmdet - INFO - Epoch [1][200/2500]    lr: 0.00100, eta: 1 day, 15:11:43, time: 1.880, data_time: 1.120, memory: 6226, loss_rpn_cls: 0.3295, loss_rpn_bbox: 0.2594, loss_cls: 0.3680, acc: 84.9922, loss_bbox: 0.3431$
 loss_mask: 0.5553, loss: 1.8553                                                                                                                                                                                                               
2020-05-15 14:41:49,633 - mmdet - INFO - Epoch [1][250/2500]    lr: 0.00125, eta: 1 day, 15:06:01, time: 1.870, data_time: 1.108, memory: 6226, loss_rpn_cls: 0.2658, loss_rpn_bbox: 0.2572, loss_cls: 0.3341, acc: 86.3379, loss_bbox: 0.3521$
 loss_mask: 0.5561, loss: 1.7653                                                                                                                                                                                                               
2020-05-15 14:43:24,071 - mmdet - INFO - Epoch [1][300/2500]    lr: 0.00150, eta: 1 day, 15:05:35, time: 1.889, data_time: 1.121, memory: 6226, loss_rpn_cls: 0.2355, loss_rpn_bbox: 0.2700, loss_cls: 0.3596, acc: 84.7324, loss_bbox: 0.4131$
 loss_mask: 0.5532, loss: 1.8316                                                                                                                                                                                                               
2020-05-15 14:44:57,690 - mmdet - INFO - Epoch [1][350/2500]    lr: 0.00175, eta: 1 day, 15:01:59, time: 1.873, data_time: 1.107, memory: 6226, loss_rpn_cls: 0.2049, loss_rpn_bbox: 0.2524, loss_cls: 0.4024, acc: 83.2246, loss_bbox: 0.4973$
 loss_mask: 0.5859, loss: 1.9431                                                                                                                                                                                                               
2020-05-15 14:46:32,112 - mmdet - INFO - Epoch [1][400/2500]    lr: 0.00200, eta: 1 day, 15:01:21, time: 1.888, data_time: 1.124, memory: 6226, loss_rpn_cls: 0.2330, loss_rpn_bbox: 0.2828, loss_cls: 0.4456, acc: 81.2344, loss_bbox: 0.4396$
 loss_mask: 0.5704, loss: 1.9714                                                                                                                                                                                                               
2020-05-15 14:48:06,092 - mmdet - INFO - Epoch [1][450/2500]    lr: 0.00225, eta: 1 day, 14:59:18, time: 1.880, data_time: 1.114, memory: 6226, loss_rpn_cls: 0.2122, loss_rpn_bbox: 0.2472, loss_cls: 0.4230, acc: 82.1914, loss_bbox: 0.4401$
 loss_mask: 0.5677, loss: 1.8901                                                                                                                                                                                                               
2020-05-15 14:49:38,977 - mmdet - INFO - Epoch [1][500/2500]    lr: 0.00250, eta: 1 day, 14:54:37, time: 1.858, data_time: 1.106, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 47.8086, loss_bbox: nan, loss_mask: 
nan, loss: nan                                                                                                                                                                                                                                 
2020-05-15 14:51:12,160 - mmdet - INFO - Epoch [1][550/2500]    lr: 0.00250, eta: 1 day, 14:51:10, time: 1.864, data_time: 1.121, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 12.1504, loss_bbox: nan, loss_mask: 
nan, loss: nan                                                                                                                                                                                                                                 
2020-05-15 14:52:45,653 - mmdet - INFO - Epoch [1][600/2500]    lr: 0.00250, eta: 1 day, 14:48:41, time: 1.870, data_time: 1.121, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 4.4023, loss_bbox: nan, loss_mask: $
an, loss: nan                                                                                                                                                                                                                                  
2020-05-15 14:54:17,839 - mmdet - INFO - Epoch [1][650/2500]    lr: 0.00250, eta: 1 day, 14:43:51, time: 1.844, data_time: 1.105, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 12.9000, loss_bbox: nan, loss_mask: 
nan, loss: nan                                                                                                                                                                                                                                 
2020-05-15 14:55:51,865 - mmdet - INFO - Epoch [1][700/2500]    lr: 0.00250, eta: 1 day, 14:42:45, time: 1.881, data_time: 1.135, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 13.4023, loss_bbox: nan, loss_mask: 
nan, loss: nan                                                                                                                                                                                                                                 
2020-05-15 14:57:24,852 - mmdet - INFO - Epoch [1][750/2500]    lr: 0.00250, eta: 1 day, 14:39:52, time: 1.860, data_time: 1.123, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 10.3004, loss_bbox: nan, loss_mask: 
nan, loss: nan                                                                                                                                                                                                                                 
2020-05-15 14:58:57,577 - mmdet - INFO - Epoch [1][800/2500]    lr: 0.00250, eta: 1 day, 14:36:45, time: 1.854, data_time: 1.113, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 16.2500, loss_bbox: nan, loss_mask: 
nan, loss: nan

Answer 1 · 2020-05-16T12:07:11.000Z

I faced similar issue, when I was training on single GPU for MS-RCNN on COCO.
But the model trains without producing nan for multiple-gpus on a single machine

Answer 2 · 2020-05-16T12:12:26.000Z

Thank you @deepakksingh for your comment, I hadn’t considered that avenue of investigation.
I will try running on dual GPUs as it’s on an Azure VM, so simple to scale.
Thanks.

Answer 3 · 2020-05-18T10:22:37.000Z

@deepakksingh I tried running it in distributed mode, on 2 GPUs on a single node, and I unfortunately encountered the same problem.

Answer 4 · 2020-05-18T11:41:39.000Z

Hello @ecm200, have you tried modifying the hyperparameters like learning rate and such?
Recently they have added https://github.com/open-mmlab/mmdetection/blob/master/docs/tutorials/new_dataset.md . That maybe helpful to you.

Answer 5 · 2020-05-18T11:46:12.000Z

@deepakksingh thanks for the suggestions.
I haven't played around too much with the learning rate and other hyperparameters yet. I've been careful to make sure that my learning rate follows the "linear scaling rule" with mini-batch size, as I have mostly been working using a single GPU with 2 images per mini-batch. Thus I've scale the learning rate accordingly (1/8 of the default).

I have seen that tutorial, and noted from it the classes argument, which I had previously missed.

I am using pre-trained models, perhaps I should start from randomized parameters instead?

Answer 6 · 2020-05-18T11:49:53.000Z

I'm new to this mmdetection framework, even I'm figuring out things.
There's no harm in giving the randomized parameters approach a try.

Answer 7 · 2020-05-18T19:17:37.000Z

I think my issues stemmed from the fact that my conversion to COCO dataset format had a few bugs in it. I am using synthetic data to obtain enough data to train the network so that it generalizes well to our small set of real data. In an effort to make the data as realistic as possible, and also to increase the number of images for training at a lower computational cost, I have implemented a bespoke augmentation work flow. This includes random translations of the simulation images, and the code did not make sure that all bounding boxes were maintained within the image frame, hence there were a possibility that the part of the bounding box was outside the image frame. This appears to have caused the issues. The images are rectangular and thus of the objects are rotated completely or partially out of the image frame. I was dealing adequately with the polygons, and the bounding boxes were being dealt with correctly when the object was completely out of frame. However, for objects that were partially rotated out of the frame, the polygon vertices were being deleted, but the bounding boxes were not being modified. I have now made sure to squeeze the bounding boxes into the image frame and this appears to be returning numerical loss values now.

Answer 8 · 2020-05-20T02:28:18.000Z

Some general suggestions to deal with such NaN losses:

check if the dataset annotations are correct
reduce the learning rate
extend the warmup iterations
add gradient clipping

Answer 9 · 2020-06-06T07:57:07.000Z

In the annotations, what is the format of bounding boxes? Is it x,y,X,Y or x,y,w,h ? @hellock

Answer 10 · 2020-06-06T08:24:18.000Z

@sizhky
It's x, y, w, h .

Answer 11 · 2020-06-06T10:13:02.000Z

Then why is it mentioned as here as x,y,X,Y?

Answer 12 · 2020-06-06T10:20:08.000Z

@sizhky
In COCO annotation, it is x, y, w, h.

Answer 13 · 2022-02-03T15:33:28.000Z

Try to reduce the learning rate by 100 or higher

Answer 14 · 2022-06-26T01:58:04.000Z

Some general suggestions to deal with such NaN losses:

check if the dataset annotations are correct

reduce the learning rate

extend the warmup iterations

add gradient clippiThe third method is very useful to me

The third method is very useful to me

Answer 15 · 2023-06-15T14:00:36.000Z

Can you please tell how to implement the third method?

Answer 16 · 2023-06-26T09:25:43.000Z

Can you please tell how to implement the third method?

maybe it is in schedule_1x.py ,change warmup_ratio