SaddleFreeOptimizer

A second order optimizer for TensorFlow that uses the Saddle-Free method of Dauphin et al. (2014) with some modifications.

Algorithm

The algorithm is described by Dauphin, et al. (2014). The implementation here follows this paper with the following exceptions:

The order of operations in the Lanczos method follows that recommended by Paige (1972).
The type of damping applied to the curvature matrix in the Krylov subspace has 3 options that can be specified in the optimizer's constructor.
Instead of applying multiple damping coefficients and finding the result with the lowest loss, this implementation uses a Marquardt-style heuristic to update the damping coefficient as per Martens (2010).
If you choose a Krylov dimension that is larger than the number of parameters in the model, then the algorithm will not perform the Lanczos method; it will essentially become a Levenberg-Marquardt method with multiple options for damping and a custom loss function. Obviously, this can only be done with very small models such as the XOR_Test example.

SFOptimizer.py is the optimizer class.
mnist/dataset.py is a utility class from https://github.com/tensorflow/models.git used to obtain MNIST data.
XOR_Test.ipynb is a Jupyter notebook containing a simple network trained to an XOR function.
AE_Test.ipynb is a Jupyter notebook containing a deep autoencoder network trained with MNIST data.

The Lanczos iteration loop is unrolled into branches in the TensorFlow graph. This allows a full step to be taken in one TF operation. However, it means the graph can get large if you use a high Krylov dimension.
As in the original paper, no re-orthogonalization is used for the Lanczos vectors. This means that they will likely become linearly dependent if the Krylov dimension is high (> 100?). There would, thus, be little benefit in attempting this.
Tested with Python 3.6.7 and TensorFlow 1.12.0