
Layer-wise Adaptive Rate Scaling in PyTorch

Primary LanguagePythonMIT LicenseMIT


Layer-wise Adaptive Rate Scaling in PyTorch

This repo contains a PyTorch implementation of layer-wise adaptive rate scaling (LARS) from the paper "Large Batch Training of Convolutional Networks" by You, Gitman, and Ginsburg.

To run, do

python train.py --optimizer LARS --cuda lars_results

It uses skeletor-ml for experiment logging.

Preliminary results

I just tested this using a ResNet18 on CIFAR-10. I used a standard gradient accumulation trick to train on very large batch sizes.

Alt text

Batch Size Test Accuracy
64 89.39
256 85.45
1024 81.2
4096 73.41
16384 64.13

As a comparison, using SGD with momentum, I am able to achieve about 93.5% test accuracy in 200 epochs using a geometric decay schedule (using this implementation). I have not done extensive hyperparameter tuning, though -- I used the default parameters suggested by the paper. I had a base learning rate of 0.1, 200 epochs, eta .001, momentum 0.9, weight decay of 5e-4, and the polynomial learning rate decay schedule.