Math pseudocode in description of SGD with Nesterov is incorrect
Closed this issue · 5 comments
The math pseudocode in the description of SGD with Nesterov is currently given as:
I believe this is incorrect. Apart from the circular definition of m_t
in the case when nesterov = False
, the definition of m_t
itself should be corrected. The correct set of equations should be:
This can be verified from the equations (3) and (4) in Sutskever et al, On the importance of initialization and momentum in deep learning, 2013, with the change of variables m_t = -v_t/epsilon
and alpha_t = epsilon
.
Ouh right, thank you very much for catching that @satyenkale! Could you make the correction with a quick pr?
Thanks! I created a PR (#901) but I am not sure if all the checks went through.
Thank you again! It should go through. The bugs in #901 seem to be related to some changes in jax that broke some of our code. We'll investigate that.