Learning rate warm up

Question

Learning rate warm up

jessejchuang opened this issue 7 years ago · 1 comments

I quote one paragraph in Kaiming He's paper, "We further explore n = 18 that leads to a 110-layer
ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging. So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle)."

In the experiments, I did not see the training error over 80% in the beginning epochs such that I did not use this warm-up scheme. However, I saw ResNet-110 indeed converge slowly than ResNet-56 in the beginning. Until epoch 18, ResNet-110 training error can beat ResNet-56 then. My question is shall we apply warm-up scheme to make ResNet-110 converge soon in earlier epochs?

Answer 1 · 2017-11-06T08:50:13.000Z

Hi,
We do not need to apply warm-up.
With or without warm-up did not affect the final result in PyTorch implementation
Jia-Ren