Question: How similar or dissimilar is this compared to Hypergradient Descent?
muellerzr opened this issue · 2 comments
Congratulations on the new optimizer, excited to try it out! In our fastai discussion group it was brought up that your optimizer seems similar to HGD. Would you be able to summarize the key differences between the two for me/us? :) Thank you in advance!
Thanks for the comments, very interesting work.
The very general idea is similar for Adam, AdaBelief and HGD I guess, adapt stepsize for each element. The key difference is how to adapt: Adam scales by smoothed version of gradient^2, AdaBelief scales by smoother (change in gradient)^2, the reddit discussion points to "diffgrad" which scales by sigmoid( change-in-grad ); HGD is different, it hard-updates the lr, by the grad w.r.t lr (for each element).
I'm somehow concerned about HGD, because it's equivalent to setting extra N parameters, where N is the dimension of network parameters, and perform gradient descent on the 2N parameters. The training is on 2N params, though inference is on N parameters. Not sure if this is a big issue, but HGD is still a very interesting direction.
Please correct me if you have other comments. BTW, is the openai discussion group online available? How to joint it?
Awesome! Thank you very much! And yes fastai has an open discord :) We're talking about this in the #chitchat channel, I'll send you an email with the invite in a moment (as I saw you posted your email in the Discord)