Loss becomes NAN after a few iterations
ecm200 opened this issue ยท 16 comments
I am training a bespoke dataset using the default Mask RCNN parameterisation (see parameters below). I have converted my bespoke dataset to COCO format, with the annotations in JSON files (polygons and bounding boxes). I have also written a bespoke image loading transform as my images are in a 12-bit format. At this time I have only a single class of object, but this is likely to increase to 5 or 6 eventually. My training and test sets has 5000 and 1500 images respectively.
I have managed to get the network training, but after a few iterations all loss functions become NAN.
I have verified the standard COCO instance example runs on my machine using the 2017 dataset without these issues. I am using an adapted version of the example training script that is supplied for training with COCO data.
After about 500 batches during training in the first Epoch, the loss functions suddenly change to NaN and don't recover. Also the validation step returns NaN, and the accuracy can seen to degrade and not recover.
Has anyone seen this before?
Configuration file for bespoke training of instance detection problem.
base_ = [
'../configs/_base_/models/mask_rcnn_r50_fpn.py',
'../configs/_base_/datasets/coco_instance.py',
'../configs/_base_/schedules/schedule_1x.py', '../configs/_base_/default_runtime.py'
]
dataset_type = 'CocoDataset'
classes=['particle']
data_root = 'advanced_seg/MMDetection_experiments/datasets/spherical_test_data_v1_5000_1500/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadMorphologiSynImage'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
dict(type='Resize', img_scale=(1296, 972), keep_ratio=True),
dict(type='Normalize', **img_norm_cfg),
#dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
dict(type='LoadMorphologiSynImage'),
dict(
type='MultiScaleFlipAug',
img_scale=(1296, 972),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='Normalize', **img_norm_cfg),
#dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
data = dict(
imgs_per_gpu=2,
workers_per_gpu=5,
train=dict(
type=dataset_type,
classes=classes,
ann_file=data_root + 'annotations/train_coco.json',
img_prefix=data_root + 'train/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
classes=classes,
ann_file=data_root + 'annotations/valid_coco.json',
img_prefix=data_root + 'valid/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
classes=classes,
ann_file=data_root + 'annotations/valid_coco.json',
img_prefix=data_root + 'valid/',
pipeline=test_pipeline))
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
checkpoint_config = dict(interval=1)
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')
])
total_epochs = 30
gpus = 1
Bespoke image loading transform
@PIPELINES.register_module()
class LoadMorphologiSynImage(object):
def __init__(self, image_scale=255.0, image_format=np.float32):
self.image_scale=image_scale
self.image_format=image_format
def __call__(self, results):
if results['img_prefix'] is not None:
filename = osp.join(results['img_prefix'],
results['img_info']['filename'])
else:
filename = results['img_info']['filename']
img = cv2.imread(filename, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
img = (img / img.max()) * self.image_scale
img = self.image_format(img)
results['filename'] = filename
results['img'] = img
results['img_shape'] = img.shape
results['ori_shape'] = img.shape
results['flip'] = False
# Set initial values for default meta_keys
results['pad_shape'] = img.shape
results['scale_factor'] = 1.0
num_channels = 1 if len(img.shape) < 3 else img.shape[2]
results['img_norm_cfg'] = dict(
mean=np.zeros(num_channels, dtype=np.float32),
std=np.ones(num_channels, dtype=np.float32),
to_rgb=False)
return results
Training script
mport mmcv
from mmcv import Config, DictAction
from mmcv.runner import init_dist
import torch
from mmdet import __version__
from mmdet.apis import set_random_seed, train_detector
from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.utils import collect_env, get_root_logger
import __main__ as main
import os
import random
import datetime
import shutil
import copy
import time
from glob import glob
#from sklearn.model_selection import train_test_split
import albumentations as A
import numpy as np
import argparse
from mmdetection_morphologi_pipelines import LoadMorphologiSynImage
BASE_DIR = 'advanced_seg/MMDetection_experiments'
WORKFLOW = [('train',1), ('val', 1)]
def parse_args():
parser = argparse.ArgumentParser(description='Train a detector')
parser.add_argument('--config', help='train config file path', default=os.path.join(BASE_DIR,'configs_morph/mmdetection_morphologi_mask_rcnn_r50_fpn_1x.py'))
parser.add_argument('--work_dir', help='the dir to save logs and models', default=os.path.join(BASE_DIR,'output'))
parser.add_argument('--workflow', type=int, help='Workflow type [0] train only, [1] train and validate every epoch', default=2)
parser.add_argument('--job_name', help='name for output files and dirs', default='spherical_test_data_v1_5000_1500_')
parser.add_argument(
'--resume-from', help='the checkpoint file to resume from')
parser.add_argument(
'--validate',
action='store_true',
help='whether to evaluate the checkpoint during training') #, default=True)
group_gpus = parser.add_mutually_exclusive_group()
group_gpus.add_argument(
'--gpus',
type=int,
help='number of gpus to use '
'(only applicable to non-distributed training)')
group_gpus.add_argument(
'--gpu-ids',
type=int,
nargs='+',
help='ids of gpus to use '
'(only applicable to non-distributed training)')
parser.add_argument('--seed', type=int, default=42, help='random seed')
parser.add_argument(
'--deterministic',
action='store_true',
help='whether to set deterministic options for CUDNN backend.')
parser.add_argument(
'--options', nargs='+', action=DictAction, help='arguments in dict')
parser.add_argument(
'--launcher',
choices=['none', 'pytorch', 'slurm', 'mpi'],
default='none',
help='job launcher')
parser.add_argument('--local_rank', type=int, default=0)
parser.add_argument(
'--autoscale-lr',
action='store_true',
help='automatically scale lr with the number of gpus',
default=True) #Added by ECM as this should always be used
args = parser.parse_args()
if 'LOCAL_RANK' not in os.environ:
os.environ['LOCAL_RANK'] = str(args.local_rank)
return args
if __name__ == '__main__':
args = parse_args()
# Output dir and job details
job_name_preamble = args.job_name
#### CONFIG
## Get the Base Config
cfg = Config.fromfile(args.config)
## Set up Config
print('------------------------------------------------------------------------------------------------------------------------')
print('[CFG] Configuration changes from defaults.')
print('------------------------------------------------------------------------------------------------------------------------')
# Get additional keyword arguments for configuration
if args.options is not None:
cfg.merge_from_dict(args.options)
# set cudnn_benchmark
if cfg.get('cudnn_benchmark', False):
torch.backends.cudnn.benchmark = True
# Setup output dir
if args.work_dir is not None:
output_base_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
output_base_dir = os.path.join(BASE_DIR,'output')
cfg.work_dir = os.path.join(output_base_dir,job_name_preamble+cfg.model.type+'_'+cfg.model.backbone.type+str(cfg.model.backbone.depth)+'_'+cfg.model.neck.type+'_'+datetime.datetime.now().strftime('%d%m%Y_%H%M%S'))
os.makedirs(cfg.work_dir, exist_ok=True)
print('[CFG] Creating output directory: ', cfg.work_dir)
shutil.copyfile(main.__file__, os.path.join(cfg.work_dir,main.__file__.split('/')[-1]))
# Resume from previous iteration
if args.resume_from is not None:
cfg.resume_from = args.resume_from
# Update the default number of GPUs if different from config.
if args.gpu_ids is not None:
cfg.gpu_ids = args.gpu_ids
else:
cfg.gpu_ids = range(1) if args.gpus is None else range(args.gpus)
# Update the autoscaler with the number of GPUs if changed from default 8 gpus.
# ECM Modified so it takes into account the actual mini-batch size, which is dependent on the number of gpus and the images per gpu.
# It now scales with the default mini-batch of 8 gpus and 2 images per gpu (16), and changes in response to changes in both gpus and/or images per gpu.
if args.autoscale_lr:
_old_lr = cfg.optimizer['lr']
cfg.optimizer['lr'] =cfg.optimizer['lr'] * (len(cfg.gpu_ids) * cfg.data.imgs_per_gpu) / (8*2)
print('[CFG] Applying linear Learning Rate correction. LR changed from ', _old_lr,'to ', cfg.optimizer['lr'])
# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
print('[CFG] Distributed environment has not been initialized.')
distributed = False
else:
print('[CFG] Distributed environment initialising...')
distributed = True
init_dist(args.launcher, **cfg.dist_params)
# Set workflow overide
if args.workflow == 1:
cfg.workflow = [('train', 1)]
elif args.workflow == 2:
cfg.workflow = [('train', 1), ('val', 1)]
print('------------------------------------------------------------------------------------------------------------------------')
print('[INFO] Initialising logging...')
print('------------------------------------------------------------------------------------------------------------------------')
# init the logger before other steps
timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
log_file = os.path.join(cfg.work_dir, '{}.log'.format(timestamp))
logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)
# init the meta dict to record some important information such as
# environment info and seed, which will be logged
meta = dict()
# log env info
env_info_dict = collect_env()
env_info = '\n'.join([('{}: {}'.format(k, v))
for k, v in env_info_dict.items()])
dash_line = '-' * 60 + '\n'
logger.info('Environment info:\n' + dash_line + env_info + '\n' +
dash_line)
meta['env_info'] = env_info
print('------------------------------------------------------------------------------------------------------------------------')
# log some basic info
logger.info('Distributed training: {}'.format(distributed))
logger.info('Config:\n{}'.format(cfg.text))
# set random seeds
if args.seed is not None:
logger.info(f'Set random seed to {args.seed}, '
f'deterministic: {args.deterministic}')
set_random_seed(args.seed, deterministic=args.deterministic)
cfg.seed = args.seed
meta['seed'] = args.seed
model = build_detector(cfg.model, train_cfg=cfg.train_cfg, test_cfg=cfg.test_cfg)
datasets = [build_dataset(cfg.data.train)] #, build_dataset(cfg.data.val)]
if len(cfg.workflow) == 2:
val_dataset = copy.deepcopy(cfg.data.val)
val_dataset.pipeline = cfg.data.train.pipeline
datasets.append(build_dataset(val_dataset))
if cfg.checkpoint_config is not None:
# save mmdet version, config file content and class names in
# checkpoints as meta data
cfg.checkpoint_config.meta = dict(
mmdet_version=__version__,
config=cfg.text,
CLASSES=datasets[0].CLASSES)
# add an attribute for visualization convenience
model.CLASSES = datasets[0].CLASSES
train_detector(
model,
datasets,
cfg,
distributed=distributed,
validate=args.validate,
timestamp=timestamp,
meta=meta)
Model training output log
loading annotations into memory...
Done (t=59.47s)
creating index...
index created!
loading annotations into memory...
Done (t=15.04s)
creating index...
index created!
2020-05-15 14:33:58,617 - mmdet - INFO - Start running, host: edmorris@willow-tree-cnn-gpu-lin64, work_dir: /home/edmorris/notebooks/Projects/WillowTree/Repo/advanced_seg/MMDetection_experiments/output/spherical_test_data_v1_5000_1500_Mas$
RCNN_ResNet50_FPN_15052020_143239
2020-05-15 14:33:58,617 - mmdet - INFO - workflow: [('train', 1), ('val', 1)], max: 30 epochs
2020-05-15 14:35:35,679 - mmdet - INFO - Epoch [1][50/2500] lr: 0.00025, eta: 1 day, 16:19:03, time: 1.937, data_time: 1.090, memory: 6226, loss_rpn_cls: 0.6667, loss_rpn_bbox: 0.3420, loss_cls: 2.2791, acc: 55.0977, loss_bbox: 0.0901$
loss_mask: 0.7499, loss: 4.1278
2020-05-15 14:37:07,541 - mmdet - INFO - Epoch [1][100/2500] lr: 0.00050, eta: 1 day, 15:15:27, time: 1.837, data_time: 1.082, memory: 6226, loss_rpn_cls: 0.5279, loss_rpn_bbox: 0.2552, loss_cls: 0.4233, acc: 81.6328, loss_bbox: 0.1590$
loss_mask: 0.5800, loss: 1.9454
2020-05-15 14:38:42,158 - mmdet - INFO - Epoch [1][150/2500] lr: 0.00075, eta: 1 day, 15:16:08, time: 1.892, data_time: 1.141, memory: 6226, loss_rpn_cls: 0.3954, loss_rpn_bbox: 0.2492, loss_cls: 0.3779, acc: 84.6680, loss_bbox: nan, l$
ss_mask: 0.5657, loss: nan
2020-05-15 14:40:16,140 - mmdet - INFO - Epoch [1][200/2500] lr: 0.00100, eta: 1 day, 15:11:43, time: 1.880, data_time: 1.120, memory: 6226, loss_rpn_cls: 0.3295, loss_rpn_bbox: 0.2594, loss_cls: 0.3680, acc: 84.9922, loss_bbox: 0.3431$
loss_mask: 0.5553, loss: 1.8553
2020-05-15 14:41:49,633 - mmdet - INFO - Epoch [1][250/2500] lr: 0.00125, eta: 1 day, 15:06:01, time: 1.870, data_time: 1.108, memory: 6226, loss_rpn_cls: 0.2658, loss_rpn_bbox: 0.2572, loss_cls: 0.3341, acc: 86.3379, loss_bbox: 0.3521$
loss_mask: 0.5561, loss: 1.7653
2020-05-15 14:43:24,071 - mmdet - INFO - Epoch [1][300/2500] lr: 0.00150, eta: 1 day, 15:05:35, time: 1.889, data_time: 1.121, memory: 6226, loss_rpn_cls: 0.2355, loss_rpn_bbox: 0.2700, loss_cls: 0.3596, acc: 84.7324, loss_bbox: 0.4131$
loss_mask: 0.5532, loss: 1.8316
2020-05-15 14:44:57,690 - mmdet - INFO - Epoch [1][350/2500] lr: 0.00175, eta: 1 day, 15:01:59, time: 1.873, data_time: 1.107, memory: 6226, loss_rpn_cls: 0.2049, loss_rpn_bbox: 0.2524, loss_cls: 0.4024, acc: 83.2246, loss_bbox: 0.4973$
loss_mask: 0.5859, loss: 1.9431
2020-05-15 14:46:32,112 - mmdet - INFO - Epoch [1][400/2500] lr: 0.00200, eta: 1 day, 15:01:21, time: 1.888, data_time: 1.124, memory: 6226, loss_rpn_cls: 0.2330, loss_rpn_bbox: 0.2828, loss_cls: 0.4456, acc: 81.2344, loss_bbox: 0.4396$
loss_mask: 0.5704, loss: 1.9714
2020-05-15 14:48:06,092 - mmdet - INFO - Epoch [1][450/2500] lr: 0.00225, eta: 1 day, 14:59:18, time: 1.880, data_time: 1.114, memory: 6226, loss_rpn_cls: 0.2122, loss_rpn_bbox: 0.2472, loss_cls: 0.4230, acc: 82.1914, loss_bbox: 0.4401$
loss_mask: 0.5677, loss: 1.8901
2020-05-15 14:49:38,977 - mmdet - INFO - Epoch [1][500/2500] lr: 0.00250, eta: 1 day, 14:54:37, time: 1.858, data_time: 1.106, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 47.8086, loss_bbox: nan, loss_mask:
nan, loss: nan
2020-05-15 14:51:12,160 - mmdet - INFO - Epoch [1][550/2500] lr: 0.00250, eta: 1 day, 14:51:10, time: 1.864, data_time: 1.121, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 12.1504, loss_bbox: nan, loss_mask:
nan, loss: nan
2020-05-15 14:52:45,653 - mmdet - INFO - Epoch [1][600/2500] lr: 0.00250, eta: 1 day, 14:48:41, time: 1.870, data_time: 1.121, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 4.4023, loss_bbox: nan, loss_mask: $
an, loss: nan
2020-05-15 14:54:17,839 - mmdet - INFO - Epoch [1][650/2500] lr: 0.00250, eta: 1 day, 14:43:51, time: 1.844, data_time: 1.105, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 12.9000, loss_bbox: nan, loss_mask:
nan, loss: nan
2020-05-15 14:55:51,865 - mmdet - INFO - Epoch [1][700/2500] lr: 0.00250, eta: 1 day, 14:42:45, time: 1.881, data_time: 1.135, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 13.4023, loss_bbox: nan, loss_mask:
nan, loss: nan
2020-05-15 14:57:24,852 - mmdet - INFO - Epoch [1][750/2500] lr: 0.00250, eta: 1 day, 14:39:52, time: 1.860, data_time: 1.123, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 10.3004, loss_bbox: nan, loss_mask:
nan, loss: nan
2020-05-15 14:58:57,577 - mmdet - INFO - Epoch [1][800/2500] lr: 0.00250, eta: 1 day, 14:36:45, time: 1.854, data_time: 1.113, memory: 6226, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_cls: nan, acc: 16.2500, loss_bbox: nan, loss_mask:
nan, loss: nan
I faced similar issue, when I was training on single GPU for MS-RCNN on COCO.
But the model trains without producing nan for multiple-gpus on a single machine
Thank you @deepakksingh for your comment, I hadnโt considered that avenue of investigation.
I will try running on dual GPUs as itโs on an Azure VM, so simple to scale.
Thanks.
@deepakksingh I tried running it in distributed mode, on 2 GPUs on a single node, and I unfortunately encountered the same problem.
Hello @ecm200, have you tried modifying the hyperparameters like learning rate and such?
Recently they have added https://github.com/open-mmlab/mmdetection/blob/master/docs/tutorials/new_dataset.md . That maybe helpful to you.
@deepakksingh thanks for the suggestions.
I haven't played around too much with the learning rate and other hyperparameters yet. I've been careful to make sure that my learning rate follows the "linear scaling rule" with mini-batch size, as I have mostly been working using a single GPU with 2 images per mini-batch. Thus I've scale the learning rate accordingly (1/8 of the default).
I have seen that tutorial, and noted from it the classes argument, which I had previously missed.
I am using pre-trained models, perhaps I should start from randomized parameters instead?
I'm new to this mmdetection framework, even I'm figuring out things.
There's no harm in giving the randomized parameters approach a try.
I think my issues stemmed from the fact that my conversion to COCO dataset format had a few bugs in it. I am using synthetic data to obtain enough data to train the network so that it generalizes well to our small set of real data. In an effort to make the data as realistic as possible, and also to increase the number of images for training at a lower computational cost, I have implemented a bespoke augmentation work flow. This includes random translations of the simulation images, and the code did not make sure that all bounding boxes were maintained within the image frame, hence there were a possibility that the part of the bounding box was outside the image frame. This appears to have caused the issues. The images are rectangular and thus of the objects are rotated completely or partially out of the image frame. I was dealing adequately with the polygons, and the bounding boxes were being dealt with correctly when the object was completely out of frame. However, for objects that were partially rotated out of the frame, the polygon vertices were being deleted, but the bounding boxes were not being modified. I have now made sure to squeeze the bounding boxes into the image frame and this appears to be returning numerical loss values now.
Some general suggestions to deal with such NaN losses:
- check if the dataset annotations are correct
- reduce the learning rate
- extend the warmup iterations
- add gradient clipping
In the annotations, what is the format of bounding boxes? Is it x,y,X,Y or x,y,w,h ? @hellock
@sizhky
In COCO annotation, it is x, y, w, h
.
Try to reduce the learning rate by 100 or higher
Some general suggestions to deal with such NaN losses:
- check if the dataset annotations are correct
- reduce the learning rate
- extend the warmup iterations
- add gradient clippiThe third method is very useful to me
The third method is very useful to me
Can you please tell how to implement the third method?
Can you please tell how to implement the third method?
maybe it is in schedule_1x.py ,change warmup_ratio