On mix precision training.
Opened this issue · 0 comments
huangh12 commented
Hello,
I've noticed that you scale up the loss by cfg.TRAIN.scale and scale down the learning_rate, weight decay and warmup_lr for mixed precision training. But I found the common practice seems that just scaling down the rescale_grad is enough. Could you please explain that? Thx!
def get_optim_params(cfg,roidb_len,batch_size):
# Create scheduler
base_lr = cfg.TRAIN.lr
lr_step = cfg.TRAIN.lr_step
lr_factor = cfg.TRAIN.lr_factor
begin_epoch = cfg.TRAIN.begin_epoch
lr_epoch = [float(epoch) for epoch in lr_step.split(',')]
lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
lr_iters = [int(epoch * roidb_len / batch_size) for epoch in lr_epoch_diff]
if cfg.TRAIN.fp16:
cfg.TRAIN.warmup_lr /= cfg.TRAIN.scale
lr_scheduler = WarmupMultiBatchScheduler(lr_iters, lr_factor, cfg.TRAIN.warmup, cfg.TRAIN.warmup_lr, cfg.TRAIN.warmup_step)
if cfg.TRAIN.fp16 == True:
optim_params = {'momentum': cfg.TRAIN.momentum,
'wd': cfg.TRAIN.wd*cfg.TRAIN.scale,
'learning_rate': base_lr/cfg.TRAIN.scale,
'rescale_grad': 1.0,
'multi_precision': True,
'clip_gradient': None,
'lr_scheduler': lr_scheduler}
else:
optim_params = {'momentum': cfg.TRAIN.momentum,
'wd': cfg.TRAIN.wd,
'learning_rate': base_lr,
'rescale_grad': 1.0,
'clip_gradient': None,
'lr_scheduler': lr_scheduler}
return optim_params
btw, it's somewhat hard to understand that why you scale up the weight decay
. Should not it be scale down either (same as lr and warmup_lr)?