titu1994/keras-adabound

about lr

Opened this issue · 1 comments

Thanks for a good optimizer
According to usage
optm = AdaBound(lr=1e-03,
final_lr=0.1,
gamma=1e-03,
weight_decay=0.,
amsbound=False)
Does the learning rate gradually increase by the number of steps?


final lr is described as Final learning rate.
but it actually is leaning rate relative to base lr and current klearning rate?

final_lr = self.final_lr * lr / self.base_lr

Final lr is approximately after 1/ gamma update steps have occurred. At this point, the clipping bounds are somewhat tight and cause the actual lr to fall close to the final lr after clipping.

In the initial updates though, the LR bounds are in the range of the initial lr so it allows for Adam type updates.

This means that if you use this optimizer on dataset for a task that SGD can't do well on (but Adam can), then this optimizer will get worse results than Adam alone. At least that's what I've experienced on Language modelling tasks.