Implementation of the AdamW optimizer(Ilya Loshchilov, Frank Hutter) for Keras.
- python 3.6
- Keras 2.1.6
- tensorflow(-gpu) 1.8.0
Additionally to a usual Keras setup for neural nets building (see Keras for details)
from AdamW import AdamW
adamw = AdamW(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0., weight_decay=0.025, batch_size=1, samples_per_epoch=1, epochs=1)
Then nothing change compared to the usual usage of an optimizer in Keras after the definition of a model's architecture
model = Sequential()
<definition of the model_architecture>
model.compile(loss="mse", optimizer=adamw, metrics=[metrics.mse], ...)
Note that the size of a batch (batch_size), number of training samples per epoch (samples_per_epoch) and the number of epochs (epochs) are necessary to the normalization of the weight decay (paper, Section 4)
- Weight decay added to the parameters optimization
- Normalized weight decay added
- Cosine annealing
- Warm restarts
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION, I. Loshchilov, F. Hutter
Fixing Weight Decay Regularization in Adam, D.P. Kingma, J. Lei Ba