Based on the paper: https://arxiv.org/abs/2306.00144
Be aware that all experiments reported in the paper were run using the JAX version of mechanic, which is available in optax via optax.contrib.mechanize
.
Mechanic aims to remove the need for tuning a learning rate scalar (i.e. the maximum learning rate in a schedule). You can use it with any pytorch optimizer and schedule. Simply replace:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
with:
from mechanic_pytorch import mechanize
optimizer = mechanize(torch.optim.SGD)(model.parameters(), lr=1.0)
# You can set the lr to anything here.
# However, excessively small values may cause numerical precision issues.
# Mechanic's scale factor will be multiplied by the base optimizer's learning rate.
That's it! The new optimizer should no longer require tuning the learning rate scale! That is, the optimizer should now be very robust to heavily mis-specified values of lr
.
pip install mechanic-pytorch
Note that the package name is mechanic-pytorch
, but you should import mechanic_pytorch
(dash replaced with underscore).
It is possible to play with the configuration of mechanic, although this should be unecessary:
optimizer = mechanize(torch.optim.SGD, s_decay=0.0, betas=(0.999,0.999999), store_delta=False)(model.parameters(), lr=0.01)
- The option
store_delta=False
is set to minimize memory usage. An minimum we currently keep one extra "slot" of memory (i.e. an extra copy of the weights). If you are ok keeping one more copy, you can setstore_delta=True
. This will make the first few iterations have a slightly more accurate update, and usually has negligible effect. - The option
s_decay
is a bit like a weight-decay term that empirically is helpful for smaller datasets. We use a default of 0.01 in all our experiments. For larger datasets, smaller values (even 0.0) often worked as well. - The option
betas
is a list of exponential weighting factors used internally in mechanic. They are NOT related to beta values found in Adam. In theory, it should be safe to provide a large list of possibilities here. The default settings of(0.9,0.99,0.999,0.9999,0.99999,0.999999)
seem to work will in a range of tasks. s_init
is the initial value for the mechanic learning rate. It should be an underestimate of the correct learning rate, and it can safely be set to a very small value (default 1e-8), although it cannot be set to zero. In particular, the theoretical analysis of mechanic includes a log(1/s_init) term. This is very robust to small values, but will eventually blow up if you makes_init
absurdly small.- You can customize the logging behavior by setting the
log_func
argument. This enables logging of the scale factors mechanic is using. Note that Mechanic produces a scale factors
that is multiplied by the base optimizer's update. So, if the base optimizer has a learning rate that is different than 1, the value ofs
should be multiplied by that base optimizer's learning rate in order to find the effective learning rate that is being applied.
mechanic
is distributed under the terms of the MIT license