Trained WRN-28-10 with batch size 64 (128 in paper).
Trained DenseNet-BC-100 (k=12) with batch size 32 and initial learning rate 0.05 (batch size 64 and initial learning rate 0.1 in paper).
Trained ResNeXt-29 4x64d with a single GPU, batch size 32 and initial learning rate 0.025 (8 GPUs, batch size 128 and initial learning rate 0.1 in paper).
Trained shake-shake models with a single GPU (2 GPUs in paper).
Trained shake-shake 26 2x64d (S-S-I) with batch size 64, and initial learning rate 0.1.
Test errors reported above are the ones at last epoch.
Experiments with only 1 run are done on different computer from the one used for experiments with 3 runs.
Results reported in the tables are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
Following data augmentations are applied to the training data:
Images are padded with 4 pixels on each side, and 28x28 patches are randomly cropped from the padded images.
Images are randomly flipped horizontally.
Results on MNIST
Model
Test Error (median of 3 runs)
# of Epochs
Training Time
ResNet-preact-20
0.38
40
9m
ResNet-preact-20, Cutout 6
0.40
40
9m
ResNet-preact-20, Cutout 8
0.32
40
9m
ResNet-preact-20, Cutout 10
0.34
40
9m
ResNet-preact-20, Cutout 12
0.30
40
9m
ResNet-preact-20, Cutout 14
0.34
40
9m
ResNet-preact-20, Cutout 16
0.35
40
9m
ResNet-preact-20, RandomErasing
0.36
40
9m
ResNet-preact-20, Mixup (alpha=1)
0.39
40
11m
ResNet-preact-20, Mixup (alpha=1)
0.37
80
21m
ResNet-preact-20, Mixup (alpha=0.5)
0.33
40
11m
ResNet-preact-20, Mixup (alpha=0.5)
0.38
80
21m
ResNet-preact-20, widening factor 4, Cutout 12
0.29
40
40m
ResNet-preact-50
0.39
40
22m
ResNet-preact-50, Cutout 12
0.31
40
22m
ResNet-preact-50, widening factor 4, Cutout 12
0.29 (1 run)
40
1h40m
shake-shake-26 2x32d (S-S-I), Cutout 12
0.29
100
1h48m
Note
Results reported in the table are the test errors at last epochs.
All models are trained using cosine annealing with initial learning rate 0.2.
Experiments
Experiment on residual units, learning rate scheduling, and data augmentation
In this experiment, the effects of the following on classification accuracy are investigated:
PyramidNet-like residual units
Cosine annealing of learning rate
Cutout
Random Erasing
Mixup
Preactivation of shortcuts after downsampling
ResNet-preact-56 is trained on CIFAR-10 with initial learning rate 0.2 in this experiment.
Note
PyramidNet paper (1610.02915) showed that removing first ReLU in residual units and adding BN after last convolutions in residual units both improve classification accuracy.
SGDR paper (1608.03983) showed cosine annealing improves classification accuracy even without restarting.
Results
PyramidNet-like units works.
It might be better not to preactivate shortcuts after downsampling when using PyramidNet-like units.
Cosine annealing slightly improves accuracy.
Cutout, RandomErasing, and Mixup all work great.
Mixup needs longer training.
Model
Test Error (median of 5 runs)
Training Time
w/ 1st ReLU, w/o last BN, preactivate shortcut after downsampling
6.45
95 min
w/ 1st ReLU, w/o last BN
6.47
95 min
w/o 1st ReLU, w/o last BN
6.14
89 min
w/ 1st ReLU, w/ last BN
6.43
104 min
w/o 1st ReLU, w/ last BN
5.85
98 min
w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. arXiv:1512.03385
He, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision. Springer International Publishing, 2016. arXiv:1603.05027, Torch implementation
Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Richard C. Wilson, Edwin R. Hancock and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1-87.12. BMVA Press, September 2016. arXiv:1605.07146, Torch implementation
Loshchilov, Ilya, and Frank Hutter. "Sgdr: Stochastic gradient descent with warm restarts." In International Conference on Learning Representations, 2017. arXiv:1608.03983, Lasagne implementation
Huang, Gao, et al. "Densely connected convolutional networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700-4708. arXiv:1608.06993, Torch implementation
Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492-1500. arXiv:1611.05431, Torch implementation
Gastaldi, Xavier. "Shake-Shake regularization." In International Conference on Learning Representations, 2017. arXiv:1705.07485, Torch implementation
DeVries, Terrance, and Graham W. Taylor. "Improved regularization of convolutional neural networks with cutout." arXiv preprint arXiv:1708.04552 (2017). arXiv:1708.04552, PyTorch implementation