Expectigrad is a first-order stochastic optimization method that fixes the known divergence issue of Adam, RMSProp, and related adaptive methods while offering better performance on well-known deep learning benchmarks.
Expectigrad introduces two innovations to adaptive gradient methods:
- Arithmetic RMS: Computes the true RMS instead of an exponential moving average (EMA). This makes Expectigrad more robust to divergence and, in theory, less susceptible to gradient noise.
- Outer momentum: Applies momentum after adapting the step sizes, not before. This reduces bias in the updates by preserving the superposition property.
See the paper for more details.
Pytorch, TensorFlow 1.x, and TensorFlow 2.x are all supported. See installation and usage below to get started.
If you use this code for published work, please cite the original paper:
@article{daley2020expectigrad,
title={Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties},
author={Daley, Brett and Amato, Christopher},
journal={arXiv preprint arXiv:2010.01356},
year={2020}
}
Use pip to quickly install Expectigrad:
pip install expectigrad
Or you can clone this repository and install manually:
git clone https://github.com/brett-daley/expectigrad.git
cd expectigrad
python setup.py -e .
Pytorch and both versions of TensorFlow are supported. Refer to the code snippets below to instantiate the optimizer for your deep learning framework.
import expectigrad
expectigrad.pytorch.Expectigrad(
params, lr=0.001, beta=0.9, eps=1e-8, sparse_counter=True
)
Args | ||
---|---|---|
params | (iterable ) |
Iterable of parameters to optimize or dicts defining parameter groups. |
lr | (float ) |
The learning rate, a scale factor applied to each optimizer step. Default: 0.001 |
beta | (float ) |
The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: 0.9 |
eps | (float ) |
A small constant added to the denominator for numerical stability. Must be greater than 0. Default: 1e-8 |
sparse_counter | (bool ) |
If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: True |
import expectigrad
expectigrad.tensorflow1.ExpectigradOptimizer(
learning_rate=0.001, beta=0.9, epsilon=1e-8, sparse_counter=True,
use_locking=False, name='Expectigrad'
)
Args | ||
---|---|---|
learning_rate | The learning rate, a scale factor applied to each optimizer step. Can be a float, tf.keras.optimizers.schedules.LearningRateSchedule , Tensor , or callable that takes no arguments and returns the value to use. Default: 0.001 |
|
beta | (float ) |
The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: 0.9 |
epsilon | (float ) |
A small constant added to the denominator for numerical stability. Must be greater than 0. Default: 1e-8 |
sparse_counter | (bool ) |
If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: True |
use_locking | (bool ) |
If True, apply use locks to prevent concurrent updates to variables. Default: False |
name | (str ) |
Optional name for the operations created when applying gradients. Default: 'Expectigrad' |
import expectigrad
expectigrad.tensorflow2.Expectigrad(
learning_rate=0.001, beta=0.9, epsilon=1e-8, name='Expectigrad', **kwargs
)
Args | ||
---|---|---|
learning_rate | The learning rate, a scale factor applied to each optimizer step. Can be a float, tf.keras.optimizers.schedules.LearningRateSchedule , Tensor , or callable that takes no arguments and returns the value to use. Default: 0.001 |
|
beta | (float ) |
The decay rate for Expectigrad's bias-corrected, "outer" momentum. Must be in the interval [0, 1). Default: 0.9 |
epsilon | (float ) |
A small constant added to the denominator for numerical stability. Must be greater than 0. Default: 1e-8 |
sparse_counter | (bool ) |
If True, Expectigrad's counter increments only where the gradient is nonzero. If False, the counter increments unconditionally. Default: True |
name | (str ) |
Optional name for the operations created when applying gradients. Default: 'Expectigrad' |
**kwargs | Keyword arguments. Allowed to be {clipnorm , clipvalue , lr , decay }. clipnorm is gradient clipping by norm; clipvalue is gradient clipping by value; decay is included for backward compatibility to allow time inverse decay of learning rate; lr is included for backward compatibility, recommended to use learning_rate instead. |