Tony-Y/pytorch_warmup

A BUG in BaseWarmup?

Closed this issue · 8 comments

I monitor the lr in training. And I surprisingly find that the stable lr after warmup is related to warmup steps. And I discover there is a dampen operation in the init function of BaseWarmup, which will permanently change the lr (to lr/warmup_steps) inside the original optimizer. This logic is strange and confusing, I wonder if it is a bug. Maybe the dampen operation in the init function should be deleted?

I init the scheduler after the warmup, so if warmup changes the lr in warmup's init function, the scheduler will get a wrong initial lr, this will further cause a problem when I call:

with warmup.dampening():
    scheduler.step()

scheduler.step() will cause the self.lrs in Warmup object to become wrong.

The initial LR must be dampened inside the init function. Do not init the scheduler after the warmup.

The initial LR must be dampened inside the init function. Do not init the scheduler after the warmup.

Sorry, I don't understand, the LR will be dampened as soon as the first time the dampening function is called. Why should we dampen it inside the init function?

This way is the same as the LR scheduler that calls self.step() inside the init function:
https://github.com/pytorch/pytorch/blob/db393fb95e5b057ca49472828bb6dba2db4f859e/torch/optim/lr_scheduler.py#L146-L151

In PyTorch 1.0 or earlier, we must call scheduler.step() before optimizer.step():
http://pytorch.org/docs/1.0.0/optim.html#how-to-adjust-learning-rate

Refresh your thought.

This way is the same as the LR scheduler that calls self.step() inside the init function:

https://github.com/pytorch/pytorch/blob/db393fb95e5b057ca49472828bb6dba2db4f859e/torch/optim/lr_scheduler.py#L146-L151

In PyTorch 1.0 or earlier, we must call scheduler.step() before optimizer.step():

http://pytorch.org/docs/1.0.0/optim.html#how-to-adjust-learning-rate

Refresh your thought.

The only difference between these two methods(whether to dampen LR in init function) is whether the optimizer use the original LR or the dampened LR in step 0? In the following steps, the optimizers will adopt the same LR in two methods. Am I right?

The LR index differs by one step. Then the LR differs during warmup.

The LR index differs by one step. Then the LR differs during warmup.

I understand, thank you for your answer and contributions.

Maybe there could be a reminder in the documentation that scheduler should be initialized before warmup.

Thank you for your suggestion. I have just updated README.