PyTorch Minimize

A wrapper for scipy.optimize.minimize to make it a PyTorch Optimizer implementing Conjugate Gradients, BFGS, l-BFGS, SLSQP, Newton Conjugate Gradient, Trust Region methods and others in PyTorch.

Warning: this project is a proof of concept and is not necessarily reliable, although the code (that's all of it) is small enough to be readable.

Quickstart

Install

Dependencies:

pytorch
scipy

The following install procedure isn't going to check these are installed.

This package can be installed with pip directly from Github:

python -m pip install git+https://github.com/gngdb/pytorch-minimize.git

Or by cloning the repository and then installing:

git clone https://github.com/gngdb/pytorch-minimize.git
cd pytorch-minimize
python -m pip install .

Using The Optimizer

The Optimizer class is MinimizeWrapper in pytorch_minimize.optim. It has the same interface as a PyTorch Optimizer, taking model.parameters(), and is configured by passing a dictionary of arguments, here called minimizer_args, that will later be passed to scipy.optimize.minimize:

from pytorch_minimize.optim import MinimizeWrapper
minimizer_args = dict(method='CG', options={'disp':True, 'maxiter':100})
optimizer = MinimizeWrapper(model.parameters(), minimizer_args)

The main difference when using this optimizer as opposed to most PyTorch optimizers is that a closure (torch.optim.LBFGS also requires this) must be defined:

def closure():
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    return loss
optimizer.step(closure)

This optimizer is intended for deterministic optimisation problems, such as full batch learning problems. Because of this, optimizer.step(closure) should only need to be called once.

Can .step(closure) be called more than once? Technically yes, but it shouldn't be necessary because multiple steps are run internally up to the maxiter option in minimizer_args and multiple calls are not recommended. Each call to optimizer.step(closure) is an independent evaluation of scipy.optimize.minimize, so the internal state of any optimization algorithm will be interrupted.

Which Algorithms Are Supported?

Using PyTorch to calculate the Jacobian, the following algorithms are supported:

Conjugate Gradients: 'CG'
Broyden-Fletcher-Goldfarb-Shanno (BFGS): 'BFGS'
Limited-memory BFGS: 'L-BFGS-B'
Sequential Least Squares Programming: 'SLSQP'

The method name string is given on the right, corresponding to the names used by scipy.optimize.minimize.

Methods that require Hessian evaluations

Warning: this is experimental and probably unpredictable.

To use the methods that require evaluating the Hessian, a Closure object with the following methods is required (full MNIST example here):

class Closure():
    def __init__(self, model):
        self.model = model
    
    @staticmethod
    def loss(model):
        output = model(data)
        return loss_fn(output, target) 

    def __call__(self):
        optimizer.zero_grad()
        loss = self.loss(self.model)
        loss.backward()
        return loss
closure = Closure(model)

The following methods can then be used:

Newton Conjugate Gradient: 'Newton-CG'
Newton Conjugate Gradient Trust-Region: 'trust-ncg'
Krylov Subspace Trust-Region: 'trust-krylov'
Nearly Exact Trust-Region: 'trust-exact'
Constrained Trust-Region: 'trust-constr'

The code contains hacks to make it possible to call torch.autograd.functional.hessian (which is itself only supplied in PyTorch as beta).

Algorithms without gradients

If using the scipy.optimize.minimize algorithms that don't require gradients (such as 'Nelder-Mead', 'COBYLA' or 'Powell'), ensure that minimizer_args['jac'] = False when instancing MinimizeWrapper.

Algorithms you can choose but don't work

Two algorithms I tested didn't converge on a toy problem or hit errors. You can still select them but they may not work:

Truncated Newton: 'TNC'
Dogleg: 'dogleg'

All the other methods that require gradients converged on a toy problem that is tested in Travis-CI.

How Does it Work?

scipy.optimize.minimize is expecting to receive a function fun that returns a scalar and an array of gradients the same size as the initial input array x0. To accomodate this, MinimizeWrapper does the following:

Create a wrapper function that will be passed as fun
In that function:
1. Unpack the umpy array into parameter tensors
2. Substitute each parameter in place with these tensors
3. Evaluate closure, which will now use these parameter values
4. Extract the gradients
5. Pack the gradients back into one 1D Numpy array
6. Return the loss value and the gradient array

Then, all that's left is to call scipy.optimize.minimize and unpack the optimal parameters found back into the model.

This procedure involves unpacking and packing arrays, along with moving back and forth between Numpy and PyTorch, which may incur some overhead. I haven't done any profiling to find out if it's likely to be a big problem and it completes in seconds when optimizing a logistic regression on MNIST by conjugate gradients.

How Does This Evaluate the Hessian?

To evaluate the Hessian in PyTorch, torch.autograd.functional.hessian takes two arguments:

func: function that returns a scalar
inputs: variables to take the derivative wrt

In most PyTorch code, inputs is a list of tensors embedded as parameters in the Modules that make up the model. They can't be passed as inputs because we typically don't have a func that will take the parameters as input, build a network from these parameters and then produce a scalar output.

From a discussion on the PyTorch forum the only way to calculate the gradient with respect to the parameters would be to monkey patch inputs into the model and then calculate the loss. I wrote a recursive monkey patch that operates on a deepcopy of the original model. This involves copying everything in the model so it's not very efficient.

The function passed to scipy.optimize.minimize as hess does the following:

copy.deepcopy the entire model Module
Input x is a Numpy array so cast it to tensor float32 and require_grad
Define a function f that unpacks this 1D Numpy array into parameter tensors
- Recursively navigate the module object
  - Deleting all existing parameters
  - Replacing them with unpacked parameters from step 2
- Calculate the loss using the static method stored in the closure object
Pass f to torch.autograd.functional.hessian and x then cast the result back into a Numpy array

Credits

If you use this in your work, please cite this repository using the following Bibtex entry, along with Numpy, Scipy and PyTorch.

@misc{gray2021minimize,
  author = {Gray, Gavin},
  title = {PyTorch Minimize},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/gngdb/pytorch-minimize}}
}

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

maharjun/pytorch-minimize