Stable-Weight-Decay-Regularization

The PyTorch Implementation of Stable Weight Decay.

The algorithms are proposed in the paper:

"Stable Weight Decay Regularization".

Why Stable Weight Decay?

We proposed the Stable Weight Decay (SWD) method to fix weight decay in modern deep learning libraries.

SWD usually makes significant improvements over both L2 regularization and decoupled weight decay.
Simply fixing weight decay in Adam by SWD, with no extra hyperparameter, can usually outperform complex Adam variants, which have more hyperparameters.
SGD with Stable Weight Decay (SGDS) also often outperforms SGD with L2 regularization.

The environment is as bellow:

Python 3.7.3

PyTorch >= 1.4.0

Usage

You may use it as a standard PyTorch optimizer.

import swd_optim

optimizer = swd_optim.AdamS(net.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-08, weight_decay=5e-4, amsgrad=False)

Test performance

Dataset	Model	AdamS	SGD M	Adam	AMSGrad	AdamW	AdaBound	Padam	Yogi	RAdam
CIFAR-10	ResNet18	4.91_0.04	5.01_0.03	6.53_0.03	6.16_0.18	5.08_0.07	5.65_0.08	5.12_0.04	5.87_0.12	6.01_0.10
	VGG16	6.09_0.11	6.42_0.02	7.31_0.25	7.14_0.14	6.48_0.13	6.76_0.12	6.15_0.06	6.90_0.22	6.56_0.04
CIFAR-100	DenseNet121	20.52_0.26	19.81_0.33	25.11_0.15	24.43_0.09	21.55_0.14	22.69_0.15	21.10_0.23	22.15_0.36	22.27_0.22
	GoogLeNet	21.05_0.18	21.21_0.29	26.12_0.33	25.53_0.17	21.29_0.17	23.18_0.31	21.82_0.17	24.24_0.16	22.23_0.15

Citing

If you use Stable Weight Decay in your work, please cite "Stable Weight Decay Regularization".