Bag of Tricks for Image Classification with Convolutional Neural Networks

This repo was inspired by Paper Bag of Tricks for Image Classification with Convolutional Neural Networks

I would test popular training tricks as many as I can for improving image classification accuarcy, feel free to leave a comment about the tricks you want me to test(please write the referenced paper along with the tricks)

hardware

Using 4 Tesla P40 to run the experiments

dataset

I will use CUB_200_2011 dataset instead of ImageNet, just for simplicity, this is a fine-grained image classification dataset, which contains 200 birds categlories, 5K+ training images, and 5K+ test images.The state of the art acc on vgg16 is around 73%(please correct me if I was wrong).You could easily change it to the ones you like: Stanford Dogs, Stanford Cars. Or even ImageNet.

network

Use a VGG16 network to test my tricks, also for simplicity reasons, since VGG16 is easy to implement. I'm considering switch to AlexNet, to see how powerful these tricks are.

tricks

tricks I've tested, some of them were from the Paper Bag of Tricks for Image Classification with Convolutional Neural Networks :

trick	referenced paper
xavier init	Understanding the difficulty of training deep feedforward neural networks
warmup training	Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
no bias decay	Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
label smoothing	Rethinking the inception architecture for computer vision)
random erasing	Random Erasing Data Augmentation
cutout	Improved Regularization of Convolutional Neural Networks with Cutout
linear scaling learning rate	Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
cosine learning rate decay	SGDR: Stochastic Gradient Descent with Warm Restarts

and more to come......

result

baseline(training from sctrach, no ImageNet pretrain weights are used):

vgg16 64.60% on CUB_200_2011 dataset, lr=0.01, batchsize=64

effects of stacking tricks

trick	acc
baseline	64.60%
+xavier init and warmup training	66.07%
+no bias decay	70.14%
+label smoothing	71.20%
+random erasing	does not work, drops about 4 points
+linear scaling learning rate(batchsize 256, lr 0.04)	71.21%
+cutout	does not work, drops about 1 point
+cosine learning rate decay	does not work, drops about 1 point