Bug in the per-coordinate clipping?
vmarkovtsev opened this issue · 2 comments
The current code for per-coordinate clipping is this:
ratio = (exp_avg.abs() / (rho * bs * hess + 1e-15)).clamp(None,1)
param.addcmul_(exp_avg.sign(), ratio, value=step_size_neg)
Let's suppose that hess
carries negative values below -1e-15 / (rho * bs)
. In that case, exp_avg.abs() / (rho * bs * hess + 1e-15)
evaluates to a negative number, potentially a very big one. .clamp(None,1)
does not truncate the negative numbers, so they survive. exp_avg.sign() * ratio * step_size_neg
becomes either an unclipped positive (if exp_avg is negative) or an unclipped negative.
In my experiments, hess
has always been positive so this problem has never appeared. The paper, however, says that:
When any entry of ht is negative, ... = ηρ·sign(mt[i]), which is the same as stochastic momentum SignSGD.
So the negative hessian values should be clipped. WDYT?
hess
cannot be negative in SophiaG since it is the diagonal entries of the Gauss-Newton matrix, which has to be non-negative.
In the paper,
When any entry of ht is negative, ... = ηρ·sign(mt[i]), which is the same as stochastic momentum SignSGD.
deals with the potential negative entries of hess
for SophiaH. The Hessian estimator in SophiaH is estimating the diagonal of the actual Hessian matrix so it can contain some negative values.
Thanks for the explanation!