Why does g_t substract m_t, instead of m_{t-1} ?

Question

Why does g_t substract m_t, instead of m_{t-1} ?

zxteloiv opened this issue 4 years ago · 1 comments

Dear authors,
Thanks for providing such a good implementation, and I benefit a lot from the repo in my experiments.
I have a question for the update of s_t in the algorithm as titled.

In my task, (g_t - m_t)^2 gives a contractive result against (g_t - m_{t-1})^2 on the choice of different betas.
Specifically, the original update (g_t - m_t)^2 suggests a greater beta2 is better (0.999 other than 0.98),
while the revised version (g_t - m_{t-1})^2 shows 0.98 is a better beta2.

Other parameters are kept the same as the default. The code version I use is pytorch-0.2.0.
To name some of them, lr=1e-3, eps=1e-16, weight_decay=0.1, weight_decoupled=True, amsgrad=False, fixed_decay=False, rectify=True.

To compare with Adam and RAdam, rectify set as False is also tested.
The contraction still occurs for the original and revised update of s_t (however, at this time the better beta2 is reversed).

I know the parameter tuning lacks much sufficient evidence to make a convincing conclusion, so I just wonder why (g_t - m_t)^2 is used?
Since (g_t - m_{t-1})^2 will compare the gradient of the current step with previous moving average, I guess it's more intuitive.

Thanks for reading my question. Wish you a good day :)

Answer 1 · 2021-02-19T17:01:51.000Z

Hi, thanks for your question. There's no specific reason for choose gt - mt or gt-m_{t-1}, I just randomly pick one. In fact, consider m_t = beta m_{t-1} + (1-beta) g_t, then g_t - m_t = \beta (g_t - m_{t-1}). As long as \beta is close to 1, there's not so much difference. I think the difference will be larger when \beta is small, although I still have no idea which one is better. Somehow I suspect the optimal choice of betas are different from default in Adam, however this requires too much computation power to find a choice that's good for most tasks, so I did not perform extensive search.