Chapter 10 Adagrad text.
avs20 opened this issue · 0 comments
avs20 commented
In chapter 10 after the square root example the text is written as
Overall, the impact is the learning rates for parameters with smaller gradients are decreased slowly, while the parameters with larger gradients have their learning rates decreased faster
I am confused over this, we are not updating learning rate anywhere (other than rate decay). Yes the weights will be updated faster for parameters with bigger gradients but much slower than they would have if no normalization is used.
Am I correct or am I understanding it incorrectly.