AdaGrad rule yields NaN on first iteration

Question

AdaGrad rule yields NaN on first iteration

Closed this issue 9 years ago · 9 comments

I'm running simple tests on small networks for variety of learning rules, and am finding that AdaGrad currently doesn't seem to work on a couple machines I tried:

Traceback (most recent call last):
  File "pylearn2/training_algorithms/sgd.py", line 470, in train
    raise Exception("NaN in " + param.name)
Exception: NaN in Output_Linear_W

This is a linear layer, but the problem seems to happen regardless of layer type.

Am I missing anything?

Answer 1 · 2015-04-23T12:51:24.000Z

Try using NanGuard to detect where it gets produced. Be aware of #1465.

Answer 2 · 2015-04-23T13:28:38.000Z

OK, here's what I got:

Traceback (most recent call last):
  ...
  File "pylearn2/pylearn2/training_algorithms/sgd.py", line 455, in train
    self.sgd_update(*batch)
  File "/Library/Python/2.7/site-packages/theano/compile/function_module.py", line 595, in __call__
    outputs = self.fn()
  File "/Library/Python/2.7/site-packages/theano/gof/link.py", line 797, in f
    raise_with_op(node, *thunks)
  File "/Library/Python/2.7/site-packages/theano/gof/link.py", line 795, in f
    wrapper(i, node, *thunks)
  File "/Library/Python/2.7/site-packages/theano/gof/link.py", line 810, in wrapper
    f(*args)
  File "pylearn2/pylearn2/devtools/nan_guard.py", line 105, in nan_check
    do_check_on(x, node, fn, False)
  File "pylearn2/pylearn2/devtools/nan_guard.py", line 84, in do_check_on
    assert False
AssertionError:
Apply node that caused the error: Elemwise{true_div,no_inplace}(DimShuffle{x,x}.0, Elemwise{sqrt,no_inplace}.0)
Inputs types: [TensorType(float32, (True, True)), TensorType(float32, matrix)]
Inputs shapes: [(1, 1), (16, 4)]
Inputs strides: [(4, 4), (16, 4)]
Inputs values: [array([[ -9.99999997e-07]], dtype=float32), 'not shown']

Backtrace when the node is created:
  File "pylearn2/pylearn2/training_algorithms/learning_rule.py", line 367, in get_updates
    * grads[param])

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

So it looks like it's specifically the delta_x_t of AdaGrad that's causing the issue. Maybe the T.sqrt() returns zero, or less likely, new_sum_squared_grad is negative?

Answer 3 · 2015-04-23T14:56:32.000Z

I would set a breakpoint in NanGuard and investigate from there. See if it's either the input or the output, etc. If your not inclined to do that yourself I would appreciate a simple example that I can test myself (yaml, dataset) because I'm a bit short on time. In the long run some form of tool to run the graph step by step would be useful, at least for me. I intend to implement it as soon as possible. Not today, tho.

Answer 4 · 2015-04-23T20:35:30.000Z

So, new_sum_squared_grad is zero and T.sqrt() returns zero also and the division on line 366 yields NaN. Adding an epsilon in either of those places (1E-9) hacks around the problem.

The case I'm testing is doing training on a test dataset that's created as follows:

a_in, a_out = numpy.zeros((8,16)), numpy.zeros((8,4))

If I change this to the following, it works fine (no NaN and no hacks required):

a_in, a_out = numpy.ones((8,16)), numpy.zeros((8,4))

There's something about entirely zero inputs to the layer that causes the NaN. I understand this is very unlikely to happen in a real dataset, but it could cause rare instabilities?

Answer 5 · 2015-04-23T20:55:33.000Z

I'm baffled by this line
https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/training_algorithms/learning_rule.py#L335
.
Anyway, for new_sum_squared_grad to be 0 both sum_square_grad and
T.sqr(grads[param]) need to be 0.
sum_square_grad seems to be set to 0 by that funky line
grads[param]) - the gradient for the parameter - may very well be 0
Disclaimer - i'm just looking at learning_rule.py. I should be reading the
paper http://www.magicbroom.info/Papers/DuchiHaSi10.pdf.

Answer 6 · 2015-04-23T21:03:47.000Z

Right, having a minimum value for the denominator would be a good idea for these corner cases. Something like T.maximum(eps, T.sqr(...)), where eps could be an additional parameter of AdaGrad.

@TNick This line defines how the update rule should build an expression for the parameter updates, starting from the gradients, learning rates, and individual learning rate multipliers for each parameter.
It also returns updates for the persistent shared variables that track moving averages for the gradients.
That function is executed only once, to build the symbolic graph computing that expression, so sum_square_grad is initialized to 0, but will be updated each time the training function gets executed.

Answer 7 · 2015-04-23T21:19:38.000Z

Thanks Pascal.
The param.get_value() * 0. inside sharedX didn't made sense to me. I
understand now that the purpose is to create a shared variable of the same
shape as param.get_value() but initialized to 0.

You want me to make a pull request?

def get_updates(self, learning_rate, grads, lr_scalers=None, eps=1e-7):
    """
    Compute the AdaGrad updates
    Parameters
    ----------
    learning_rate : float
        Learning rate coefficient.
    grads : dict
        A dictionary mapping from the model's parameters to their
        gradients.
    lr_scalers : dict
        A dictionary mapping from the model's parameters to a learning
        rate multiplier.
    eps : float, optional
        TODO
    """
    updates = OrderedDict()
    for param in grads.keys():

        # sum_square_grad := \sum g^2
        sum_square_grad = sharedX(param.get_value() * 0.)

        if param.name is not None:
            sum_square_grad.name = 'sum_square_grad_' + param.name

        # Accumulate gradient
        new_sum_squared_grad = (
            sum_square_grad + T.sqr(grads[param])
        )

        # Compute update
        epsilon = lr_scalers.get(param, 1.) * learning_rate
        delta_x_t = (- epsilon / T.maximum(eps,

T.sqrt(new_sum_squared_grad))
* grads[param])

        # Apply update
        updates[sum_square_grad] = new_sum_squared_grad
        updates[param] = param + delta_x_t

    return updates

2015-04-24 0:03 GMT+03:00 Pascal Lamblin notifications@github.com:

Right, having a minimum value for the denominator would be a good idea for
these corner cases. Something like T.maximum(eps, T.sqr(...)), where eps
could be an additional parameter of AdaGrad.

@TNick https://github.com/TNick This line defines how the update rule
should build an expression for the parameter updates, starting from the
gradients, learning rates, and individual learning rate multipliers for
each parameter.
It also returns updates for the persistent shared variables that track
moving averages for the gradients.
That function is executed only once, to build the symbolic graph computing
that expression, so sum_square_grad is initialized to 0, but will be
updated each time the training function gets executed.

—
Reply to this email directly or view it on GitHub
#1496 (comment).

Answer 8 · 2015-04-23T21:43:03.000Z

@TNick Thanks, a pull request along those lines would be great. Actually, rather than defining eps as an additional parameter, I think it would be better to use max_scaling as 1. / eps, as defined in RMSProp. That way, the interfaces of different rules are more consistent with each other.
The PR should also have a unit test that checks that no NaN gets generated in the case reported by @alexjc

Answer 9 · 2015-05-13T15:56:35.000Z

Done in #1521.