Chapter 10 Adagrad text.

Question

Chapter 10 Adagrad text.

avs20 opened this issue 4 years ago · 0 comments

In chapter 10 after the square root example the text is written as

Overall, the impact is the learning rates for parameters with smaller gradients are decreased slowly, while the parameters with larger gradients have their learning rates decreased faster

I am confused over this, we are not updating learning rate anywhere (other than rate decay). Yes the weights will be updated faster for parameters with bigger gradients but much slower than they would have if no normalization is used.

Am I correct or am I understanding it incorrectly.