Epsilon is important to Adaptive Optimizer
Closed this issue · 1 comments
Hi~
#18 (comment)
Since I asked you question last time, I've done a series of experiments. I think both methods of determining the step size of the descent are plausible, whether based on the variance of the gradient or the square of the gradient. I found that if epsilon's position is changed, the result similar to adabelief can be achieved. I did some experiments and analysis and put it in https://github.com/yuanwei2019/EAdam-optimizer
Thanks a lot for the nice experiments. This could point to a new direction I have not extensively explored.
It’s possible that the gradient has a mean very close to 0 ( perhaps batchnorm does some centralization in the gradient), in this case the second momentum is dominated by variance and both ways of treating eps are similar. Could you try EAdam on SN-GAN? Perhaps grad in this case does not have a zeros mean (I’m not sure, just guess)
It’s also possible that eps is large compared to Gt^2, and the denominator is dominated by eps.
The third possible reason is that s_t and v_t are truly bounded below after adding eps, which matches the assumption of theoretical proof.
Thanks for the nice experiments. Could be an important supplement to the paper. Due limited GPU, I was not able to run on large datasets such as COCO, it’s very nice that you reported new results. Good to see that AdaBelief and EAdam outperforms others in more experiments.