mahyarnajibi/SNIPER

On mix precision training.

Opened this issue · 0 comments

Hello,
I've noticed that you scale up the loss by cfg.TRAIN.scale and scale down the learning_rate, weight decay and warmup_lr for mixed precision training. But I found the common practice seems that just scaling down the rescale_grad is enough. Could you please explain that? Thx!

def get_optim_params(cfg,roidb_len,batch_size):
    # Create scheduler
    base_lr = cfg.TRAIN.lr
    lr_step = cfg.TRAIN.lr_step
    lr_factor = cfg.TRAIN.lr_factor
    begin_epoch = cfg.TRAIN.begin_epoch
    lr_epoch = [float(epoch) for epoch in lr_step.split(',')]
    lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
    lr_iters = [int(epoch * roidb_len / batch_size) for epoch in lr_epoch_diff]
    if cfg.TRAIN.fp16:
        cfg.TRAIN.warmup_lr /= cfg.TRAIN.scale
    lr_scheduler = WarmupMultiBatchScheduler(lr_iters, lr_factor, cfg.TRAIN.warmup, cfg.TRAIN.warmup_lr, cfg.TRAIN.warmup_step)

    if cfg.TRAIN.fp16 == True:
        optim_params = {'momentum': cfg.TRAIN.momentum,
                        'wd': cfg.TRAIN.wd*cfg.TRAIN.scale,
                        'learning_rate': base_lr/cfg.TRAIN.scale,
                        'rescale_grad': 1.0,
                        'multi_precision': True,
                        'clip_gradient': None,
                        'lr_scheduler': lr_scheduler}
    else:
        optim_params = {'momentum': cfg.TRAIN.momentum,
                        'wd': cfg.TRAIN.wd,
                        'learning_rate': base_lr,
                        'rescale_grad': 1.0,
                        'clip_gradient': None,
                        'lr_scheduler': lr_scheduler}

    return optim_params

btw, it's somewhat hard to understand that why you scale up the weight decay. Should not it be scale down either (same as lr and warmup_lr)?