This repository presents an implementation of one of the Hamiltonian Descent Methods presented in
Maddison, C. J., Paulin, D., Teh, Y. W., O'Donoghue, B., & Doucet, A. (2018). Hamiltonian Descent Methods. arXiv:1809.05042
Creative as we (Jannis Zeller, David Schischke) are, we named the implemented optimizer HaGraD
.
Many commonly used optimization methods in Machine Learning (ML) show linear to sublinear convergence rates. Hamiltonian Descent Methods (HD) are a method that extend linear convergence for a broader class of convex functions (Maddison et al., 2018). We provide implementations of a specific HD in TensorFlow (tf.keras
) and PyTorch. In the notebooks we compare the method's performance to stochastic gradient descent (SGD) and adaptive moment estimation (Adam) on standard visual tasks an on one Natural Language Processing (NLP)-task. This is just a side project with all scripts and notebooks executed locally. They are not completely unified and should just be viewed as first glimpse.
This section provides a brief summary of the HD methods proposed by Maddison et al. (2018), adding some context and comparisons to commonly used optimization methods for Neural Networks.
Refer to any book in theoretical mechanics for the following. Like other "Hamiltonian-labelled" approaches (e. g. Hamiltonian Monte Carlo Sampling) HD methods are rooted in physics. In classical mechanics a system and its propagation is fully specified by space coordinates
Optimization methods basically seek optima (considering just minima in the following) of any kind of objective function. They typically do so by iteratively updating the variables
Optimization methods typically place several (but not always identical) requirements on the objective function, some of which are of relevance for the understanding of HD. For more concise definitions, we refer to Bierlaire (2015) for a well-written and (comparably) easy to understand introductory textbook on optimization that is freely available.
- Convexity of the Objective Function: Convex functions are a family of functions that have desirable properties for optimization methods, mainly the fact that any local optimum we find is certain to be a global optimum as well. In brief, convex functions describe the family of functions for which the connecting line between two points is above the function values at every point. All optimization methods in this notebook are designed to work on convex functions.
-
Constrainedness of the Objective Function: An optimization is constrained if the objective function is restricted by equality requirements (e.g.
$\min f(x) \text{ s.t. } c(x) = 0$ ) or by inequality requirements (e.g.$\min f(x) \text{ s.t. } c(x) \leq 0$ ). All optimization methods presented in this notebook perform unconstrained optimization. -
Differentiability of the Objective Function: Optimization methods require different degrees of differentiability of the objective function. All optimization methods presented in this notebook require information about the gradient of the objective function, which means that the objective function must be differentiable once (i.e.
$f \in C^1$ ).
In addition to the differences in requirements on the objective function, optimization algorithms differ in their efficiency in terms of iterations
-
Quadratic:
$\lVert x^{k+1} - \hat x\rVert \leq C\lVert x^k - \hat x\rVert^2$ for some$C< \infty$ -
Superlinear:
$\lVert x^{k+1} - \hat x \rVert \leq \alpha_k \lVert x^k - \hat x\rVert$ for$\alpha_k \downarrow 0$ -
Linear:
$\lVert x^{k+1} - \hat x\rVert\leq \alpha \lVert x^k - \hat x \rVert$ for some$\alpha < 1$ -
Sublinear:
$\lVert x^k - \hat x\rVert \leq \frac C{k^\beta}$ , often$\beta \in {2,1,\frac12}$
Our implementation of HD follows the first explicit algorithm from Maddison et al. (2018, p. 17). I. e. given
We set up HaGraD using TensorFlow version 2.6. This is just a little side project, so we apologize for not having the time to set up a proper python package or something for easy import and direct usage with your code until now. Feel free to simply use the files from ./src and alter them according to your needs. They can then easily be imported to other python scripts or notebooks using pythons import logic. A HaGraD optimizer consists of the base class Hagrad
, which can be imported from its source-file hagrad.py, and a kinetic energy function's gradient. The kinetic energy is not fixed per se but we provide three standard choices that are also mentioned in Maddison et al. (2018), namely:
- Classic Kinetic Energy:
$$T = \frac{p^2}{2} \qquad \Rightarrow \qquad \nabla_pT = p .$$ - Relativistic Kinetic Energy
$$T = \sqrt{\vert\vert p\vert\vert^2+1}-1 \qquad \Rightarrow \qquad \nabla_p T = \frac{p}{\sqrt{\vert\vert p\vert\vert^2 + 1}}.$$ - "Power" Kinetic Energy
$$T = \frac{1}{A}\cdot \big(\vert\vert p\vert\vert^a+1\big)^{A/a}-\frac{1}{A} \qquad \Rightarrow \qquad \nabla_p T = p \cdot \vert\vert p\vert\vert^{a-2} \cdot \big( \vert\vert p\vert\vert|^{a} + 1 \big)^{A/a-1}.$$
We mainly used the first two of the three and decided to set the default to the relativistic kinetic energy. The class KineticEnergyGradients
in kinetic_energy_gradients provides these three as static methods.
HaGraD can then be used like in:
import tensorflow.keras as keras
import numpy as np
from hagrad import Hagrad
from kinetic_energy_gradients import KineticEnergyGradients
## Define Optimizer
hagrad = Hagrad(
p0_mean=0.001,
kinetic_energy_gradient=KineticEnergyGradients.relativistic())
## Generating Data (checkerboard)
X = 2 * (np.random.rand(1000, 2) - 0.5)
y = np.array(X[:, 0] * X[:, 1] > 0, np.int32)
## Define Model
model = keras.models.Sequential([
keras.layers.Flatten(),
keras.layers.Dense(8, activation="relu"),
keras.layers.Dense(8, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
## Compile Model
model.compile(
loss=keras.losses.binary_crossentropy,
optimizer=hagrad,
metrics=["accuracy"])
model.fit(X, y, epochs=10, batch_size=32)
This is basically the main-function from hagrad.py that can be run from the terminal using python -m src.hagrad
(or comparable, depending on how the dependencies are managed).
Please refer to the test-case in the torch_hagrad.py source file. In torch the setup is a little longer and we do not want to extend this README even more. In short you should be able to use it just like the built in optimizers (just like in TensorFlow). Note that for the PyTorch version, we decided to not provide an extra class with Kinetic Energy Gradients. Just choose between "relativistic"
and "classical"
at initialization.