This repo was inspired by Paper Bag of Tricks for Image Classification with Convolutional Neural Networks
I would test popular training tricks as many as I can for improving image classification accuarcy, feel free to leave a comment about the tricks you want me to test(please write the referenced paper along with the tricks)
Using 4 Tesla P40 to run the experiments
I will use CUB_200_2011 dataset instead of ImageNet, just for simplicity, this is a fine-grained image classification dataset, which contains 200 birds categlories, 5K+ training images, and 5K+ test images.The state of the art acc on vgg16 is around 73%(please correct me if I was wrong).You could easily change it to the ones you like: Stanford Dogs, Stanford Cars. Or even ImageNet.
Use a VGG16 network to test my tricks, also for simplicity reasons, since VGG16 is easy to implement. I'm considering switch to AlexNet, to see how powerful these tricks are.
tricks I've tested, some of them were from the Paper Bag of Tricks for Image Classification with Convolutional Neural Networks :
trick | referenced paper |
---|---|
xavier init | Understanding the difficulty of training deep feedforward neural networks |
warmup training | Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour |
no bias decay | Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes |
label smoothing | Rethinking the inception architecture for computer vision) |
random erasing | Random Erasing Data Augmentation |
cutout | Improved Regularization of Convolutional Neural Networks with Cutout |
linear scaling learning rate | Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour |
cosine learning rate decay | SGDR: Stochastic Gradient Descent with Warm Restarts |
and more to come......
baseline(training from sctrach, no ImageNet pretrain weights are used):
vgg16 64.60% on CUB_200_2011 dataset, lr=0.01, batchsize=64
effects of stacking tricks
trick | acc |
---|---|
baseline | 64.60% |
+xavier init and warmup training | 66.07% |
+no bias decay | 70.14% |
+label smoothing | 71.20% |
+random erasing | does not work, drops about 4 points |
+linear scaling learning rate(batchsize 256, lr 0.04) | 71.21% |
+cutout | does not work, drops about 1 point |
+cosine learning rate decay | does not work, drops about 1 point |