This repository contains the implementation for the AdamWL2SP
optimizer, as
described in this blog post.
AdamWL2SP
is the adaptive moment estimation (Adam
) optimizer with decoupled
weight decay and
The optimizer is implemented in src/transformers_fine_tuning/optim/adamwl2sp.py
and is based on
the PyTorch implementation of AdamW
.
src/transformers_fine_tuning/transformers/trainer_optimizer_init.py
is a subclass of Trainer
from the 🤗 transformers
library that
facilitates custom optimizers, such as our AdamWL2SP
. It is not strictly
necessary but we prefer the design to Trainer
. (In case of further interest,
see Passing optimizer to Trainer constructor does not work #18635.)
The example script fine-tune.py
demonstrates using our code to fine-tune
ALBERT on the RTE task, using optimizers such as AdamW
from torch
, or our
custom AdamWL2SP
optimizer. The hyperparameters are set the same as were used
in our experiments. The model, optimizer, task, random seeds and hyperparameters
can be modified by setting the appropriate global variables in the script.
It will work with CPU
, GPU
via cuda
and TPU
via torch_xla
, with optional
concurrency if multiple TPU
cores are available.
Note that this is not the actual script that was used to run our experiments, which
performs additional tracking of the metrics. For given seeds the results will
not reproduce those reported in the blog. However, a series of fine-tuning runs
with fine-tune.py
should produce qualitatively similar results.
The python module itself transformers_fine_tuning
is in the subdirectory src/transformers_fine_tuning
.
In order to prepare an environment to run the example python script
fine-tune.py
, clone this repository and run
source setup.sh
in the console, which will install dependencies and set up environment variables.
For fine-tuning without concurrency simply run in the console:
python fine-tune.py
It will automatically use the TPU
processor if one is available.
For multi-core TPU
environments, concurrent training can be done as
follows. For example, if 8 cores are available:
python transformers/examples/pytorch/xla_spawn.py --num_cores 8 fine-tune.py
For fixed random seeds, concurrent training will not replicate a non-concurrent training run.