Deep Learning based Image Classifier
Dataset Overview
The dataset used is a subset of the Imagenet dataset. It contains images for 3 classes - dog, cat, human. The number of images per class are as follows:
- Class-1 (Dog): 894
- Class-2 (Cat): 1132
- Class-3 (Human): 874
Total number of images are 2900.
I have used some data augmentation techniques to increase the size of the dataset. They are as follows:
- Flip the image about the vertical axis.
- Add gaussian noise.
- Increase the contrast of the image.
- Crop the image by 10% from all 4 edges.
- Rotate the image by 20-degrees.
The new dataset has the following number of images:
- Class-1 (Dog): 5364
- Class-2 (Cat): 6792
- Class-3 (Human): 5244
Total number of images are 17400.
The number of images required to build a robust classifier is a function of the complexity of the model we're trying to train. The more complex the model (i.e., the more parameters it learns), the more data it requires. For example, a simple classifier like SVM will require less data to perform good enough as comapred to a multi-layered CNN.
Models
I have experimented with the following models:
-
Basic CNN: This model consists of the following layers:
- Layer 1: A convolution layer with kernel size: 5 x 5 x 32.
- Layer 2: A max-pooling layer with downscale factor of 2.
- Layer 3: A convolution layer with kernel size: 5 x 5 x 64.
- Layer 4: A max-pooling layer with downscale factor of 2.
- Layer 5: A dense layer with 1024 hidden units.
- Layer 6: Soft-max layer with 3 nodes (i.e. the number of classes).
-
Wide Residual Network: The exact details of this architecture are described in the paper (Link).
Results
Image size used: 32 x 32 x 3
-
Model: Basic CNN
Batch size: 100
Learning rate: 0.01
Iterations: 20,000
Drop-out probability: 0.5 (training), 1.0 (testing)
Test accuracy: 80% -
Model: Wide Residual Network
S.R. No. | N/w Width (k) | Units per block (n) | Learning Rate | Batch Size | Iterations | Dropout Prob | Test Acc (%) |
---|---|---|---|---|---|---|---|
1 | 1 | 2 | 0.01 | 100 | 50,000 | 0.5 | 90 |
2 | 1 | 3 | 0.01 | 100 | 50,000 | 0.5 | 93 |
3 | 2 | 2 | 0.01 | 100 | 50,000 | 0.5 | 94 |
4 | 2 | 3 | 0.01 | 100 | 50,000 | 0.5 | 93.8 |
5 | 2 | 2 | 0.01 | 100 | 50,000 | 0.3 | 94.2 |
6 | 2 | 2 | 0.01 | 200 | 50,000 | 0.3 | 94.8 |
7 | 2 | 3 | 0.01 | 200 | 50,000 | 0.3 | 95.36 |
8 | 3 | 3 | 0.01 | 200 | 100,000 | 0.3 | 95.90 |
Further Improvements
We can improve the performance further by using transfer learning in the following 2 ways -
-
Cut-off the final layer of a model pre-trainned on similar kind of classes. Attach a new softmax layer for our 3 classes and start training this new model on our data. This is called fine-tuning.
-
Pass the images through a pre-trainned model and extract features from the second-last layer. Use these features with some other classifier like SVM or xgboost (boosting with decision trees). In case of xgboost we will have to train 3 one-vs-all classifiers and take a weighted combination of their outputs at the time of testing.
With both the above methods we can train a very good performance classifier using relatively less amount of data and much faster as well.
We can also experiment with ensemble networks to see if the performance can be improved.