A wrapper for scipy.optimize.minimize
to make it a PyTorch
Optimizer implementing Conjugate Gradients, BFGS, l-BFGS, SLSQP, Newton
Conjugate Gradient, Trust Region methods and others in PyTorch.
Warning: this project is a proof of concept and is not necessarily reliable, although the code (that's all of it) is small enough to be readable.
Dependencies:
pytorch
scipy
The following install procedure isn't going to check these are installed.
This package can be installed with pip
directly from Github:
python -m pip install git+https://github.com/gngdb/pytorch-minimize.git
Or by cloning the repository and then installing:
git clone https://github.com/gngdb/pytorch-minimize.git
cd pytorch-minimize
python -m pip install .
The Optimizer class is MinimizeWrapper
in pytorch_minimize.optim
. It
has the same interface as a PyTorch Optimizer, taking
model.parameters()
, and is configured by passing a dictionary of
arguments, here called minimizer_args
, that will later be passed to
scipy.optimize.minimize
:
from pytorch_minimize.optim import MinimizeWrapper
minimizer_args = dict(method='CG', options={'disp':True, 'maxiter':100})
optimizer = MinimizeWrapper(model.parameters(), minimizer_args)
The main difference when using this optimizer as opposed to most PyTorch
optimizers is that a closure (torch.optim.LBFGS
also
requires this) must be defined:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
This optimizer is intended for deterministic optimisation problems,
such as full batch learning problems. Because of this,
optimizer.step(closure)
should only need to be called once.
Can .step(closure)
be called more than once? Technically yes, but it
shouldn't be necessary because multiple steps are run internally up to the
maxiter
option in minimizer_args
and multiple calls are not
recommended. Each call to optimizer.step(closure)
is an independent
evaluation of scipy.optimize.minimize
, so the internal state of any
optimization algorithm will be interrupted.
Using PyTorch to calculate the Jacobian, the following algorithms are supported:
- Conjugate Gradients:
'CG'
- Broyden-Fletcher-Goldfarb-Shanno (BFGS):
'BFGS'
- Limited-memory BFGS:
'L-BFGS-B'
- Sequential Least Squares Programming:
'SLSQP'
The method name string is given on the right, corresponding to the names used by scipy.optimize.minimize.
Warning: this is experimental and probably unpredictable.
To use the methods that require evaluating the Hessian, a Closure
object
with the following methods is required (full MNIST example
here):
class Closure():
def __init__(self, model):
self.model = model
@staticmethod
def loss(model):
output = model(data)
return loss_fn(output, target)
def __call__(self):
optimizer.zero_grad()
loss = self.loss(self.model)
loss.backward()
return loss
closure = Closure(model)
The following methods can then be used:
- Newton Conjugate Gradient:
'Newton-CG'
- Newton Conjugate Gradient Trust-Region:
'trust-ncg'
- Krylov Subspace Trust-Region:
'trust-krylov'
- Nearly Exact Trust-Region:
'trust-exact'
- Constrained Trust-Region:
'trust-constr'
The code contains hacks to make it possible to call torch.autograd.functional.hessian (which is itself only supplied in PyTorch as beta).
If using the scipy.optimize.minimize
algorithms that don't require
gradients (such as 'Nelder-Mead'
, 'COBYLA'
or 'Powell'
), ensure that
minimizer_args['jac'] = False
when instancing MinimizeWrapper
.
Two algorithms I tested didn't converge on a toy problem or hit errors. You can still select them but they may not work:
- Truncated Newton:
'TNC'
- Dogleg:
'dogleg'
All the other methods that require gradients converged on a toy problem that is tested in Travis-CI.
scipy.optimize.minimize
is expecting to receive a function fun
that
returns a scalar and an array of gradients the same size as the initial
input array x0
. To accomodate this, MinimizeWrapper
does the following:
- Create a wrapper function that will be passed as
fun
- In that function:
- Unpack the umpy array into parameter tensors
- Substitute each parameter in place with these tensors
- Evaluate
closure
, which will now use these parameter values - Extract the gradients
- Pack the gradients back into one 1D Numpy array
- Return the loss value and the gradient array
Then, all that's left is to call scipy.optimize.minimize
and unpack the
optimal parameters found back into the model.
This procedure involves unpacking and packing arrays, along with moving back and forth between Numpy and PyTorch, which may incur some overhead. I haven't done any profiling to find out if it's likely to be a big problem and it completes in seconds when optimizing a logistic regression on MNIST by conjugate gradients.
To evaluate the Hessian in PyTorch,
torch.autograd.functional.hessian
takes two arguments:
func
: function that returns a scalarinputs
: variables to take the derivative wrt
In most PyTorch code, inputs
is a list of tensors embedded as parameters
in the Modules that make up the model
. They can't be passed as inputs
because we typically don't have a func
that will take the parameters as
input, build a network from these parameters and then produce a scalar
output.
From a discussion on the PyTorch forum the only way to calculate
the gradient with respect to the parameters would be to monkey patch
inputs
into the model and then calculate the loss. I wrote a recursive
monkey patch that operates on a deepcopy of the original
model
. This involves copying everything in the model so it's not very
efficient.
The function passed to scipy.optimize.minimize
as hess
does the
following:
copy.deepcopy
the entiremodel
Module- Input
x
is a Numpy array so cast it to tensor float32 andrequire_grad
- Define a function
f
that unpacks this 1D Numpy array into parameter tensors- Recursively navigate the module object
- Deleting all existing parameters
- Replacing them with unpacked parameters from step 2
- Calculate the loss using the static method stored in the
closure
object
- Recursively navigate the module object
- Pass
f
totorch.autograd.functional.hessian
andx
then cast the result back into a Numpy array
If you use this in your work, please cite this repository using the following Bibtex entry, along with Numpy, Scipy and PyTorch.
@misc{gray2021minimize,
author = {Gray, Gavin},
title = {PyTorch Minimize},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/gngdb/pytorch-minimize}}
}
This package was created with Cookiecutter and the
audreyr/cookiecutter-pypackage
project template.