Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

Expectigrad is a first-order stochastic optimization method that fixes the known divergence issue of Adam, RMSProp, and related adaptive methods while offering better performance on well-known deep learning benchmarks.

Expectigrad introduces two innovations to adaptive gradient methods:

Arithmetic RMS: Computes the true RMS instead of an exponential moving average (EMA). This makes Expectigrad more robust to divergence and, in theory, less susceptible to gradient noise.
Outer momentum: Applies momentum after adapting the step sizes, not before. This reduces bias in the updates by preserving the superposition property.

See the paper for more details.

Pytorch, TensorFlow 1.x, and TensorFlow 2.x are all supported. See installation and usage below to get started.

Pseudocode

Citing

If you use this code for published work, please cite the original paper:

@article{daley2020expectigrad,
  title={Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties},
  author={Daley, Brett and Amato, Christopher},
  journal={arXiv preprint arXiv:2010.01356},
  year={2020}
}

Installation

Use pip to quickly install Expectigrad:

pip install expectigrad

Or you can clone this repository and install manually:

git clone https://github.com/brett-daley/expectigrad.git
cd expectigrad
python setup.py -e .

Usage

Pytorch and both versions of TensorFlow are supported. Refer to the code snippets below to instantiate the optimizer for your deep learning framework.

Pytorch

import expectigrad

expectigrad.pytorch.Expectigrad(
    params, lr=0.001, beta=0.9, eps=1e-8, sparse_counter=True
)

Args
params	(`iterable`)	Iterable of parameters to optimize or dicts defining parameter groups.
lr	(`float`)	The learning rate, a scale factor applied to each optimizer step. Default: `0.001`
beta	(`float`)	The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: `0.9`
eps	(`float`)	A small constant added to the denominator for numerical stability. Must be greater than 0. Default: `1e-8`
sparse_counter	(`bool`)	If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: `True`

Tensorflow 1.x

import expectigrad

expectigrad.tensorflow1.ExpectigradOptimizer(
    learning_rate=0.001, beta=0.9, epsilon=1e-8, sparse_counter=True,
    use_locking=False, name='Expectigrad'
)

Args
learning_rate		The learning rate, a scale factor applied to each optimizer step. Can be a float, `tf.keras.optimizers.schedules.LearningRateSchedule`, `Tensor`, or callable that takes no arguments and returns the value to use. Default: `0.001`
beta	(`float`)	The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: `0.9`
epsilon	(`float`)	A small constant added to the denominator for numerical stability. Must be greater than 0. Default: `1e-8`
sparse_counter	(`bool`)	If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: `True`
use_locking	(`bool`)	If True, apply use locks to prevent concurrent updates to variables. Default: `False`
name	(`str`)	Optional name for the operations created when applying gradients. Default: `'Expectigrad'`

Tensorflow 2.x

import expectigrad

expectigrad.tensorflow2.Expectigrad(
    learning_rate=0.001, beta=0.9, epsilon=1e-8, name='Expectigrad', **kwargs
)

Args
learning_rate		The learning rate, a scale factor applied to each optimizer step. Can be a float, `tf.keras.optimizers.schedules.LearningRateSchedule`, `Tensor`, or callable that takes no arguments and returns the value to use. Default: `0.001`
beta	(`float`)	The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: `0.9`
epsilon	(`float`)	A small constant added to the denominator for numerical stability. Must be greater than 0. Default: `1e-8`
sparse_counter	(`bool`)	If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: `True`
name	(`str`)	Optional name for the operations created when applying gradients. Default: `'Expectigrad'`
**kwargs		Keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is gradient clipping by norm; `clipvalue` is gradient clipping by value; `decay` is included for backward compatibility to allow time inverse decay of learning rate; `lr` is included for backward compatibility, recommended to use `learning_rate` instead.