This repository is an attempt for the Computer Vision Challenge by Grab.
Table of Contents:
Model is built with fast.ai v1 and PyTorch v1, trained on Google Cloud Platform's Deep Learning VM with 16GB NVIDIA Tesla T4.
Data consist of 8144 Training Images (80:20 Train:Validation Split) and 8041 Test Images. Architecture used is ResNet-152 with squared image (299x299), pretrained with ImageNet. Data is augmented with several affine and perspective transformation. Mixup technique is used. Final Top-1 Accuracy is 92.53% on Test Images.
Stanford Car Model Training.ipynb is the notebook used to perform model training and evaluation.
All models are evaluated with Top-1 Accuracy based on the test set provided here.
Stopping Criteria for all models is when no improvement on validation loss across 2 Cycles of training. One cycle of training refers to training with any number of epochs with the One Cycle Policy.
- Comparing different image dimension (Squared Image)
Training Technique | Resnet 50 | Resnet 101 | Resnet 152 |
---|---|---|---|
Baseline - Image Size (224x224) | 87.3 | 88.9 | 89.9 |
Baseline - Image Size (299x299) | 88.0 | 90.3 | 90.7 |
299x299 image size yield better results. This criteria is applied to all further models.
- Comparing Resizing Methods
Training Technique | Resnet 50 | Resnet 101 | Resnet 152 |
---|---|---|---|
Resizing Method - Zero Padding | 86.0 | - | - |
Resizing Method - Crop | 86.6 | - | - |
Resizing Method - Squishing | 88.0 | - | - |
Squishing image yield better results. This criteria is applied to all further models.
- Using training set with cropped Bounding Box provided
Training Technique | Resnet 50 | Resnet 101 | Resnet 152 |
---|---|---|---|
Without Bounding Box | 88.0 | 90.3 | 90.7 |
With Bounding Box | 70.3 | 71.7 | 71.9 |
Training Set without bounding box yield better results. This criteria is applied to all further models.
- Using Mix Up on training data
Training Technique | Resnet 50 | Resnet 101 | Resnet 152 |
---|---|---|---|
Without Mix Up | 88.0 | 90.3 | 90.7 |
With Mix Up | 89.3 | 90.9 | 92.53 |
Training done on Google Cloud Platform Deep Learning VM with GPU 16GB NVIDIA Tesla T4, with batch size of 16.
Resnet 50 | Resnet 101 | Resnet 152 | |
---|---|---|---|
Training Time per epoch | 3:30 minutes | 4:10 minutes | 5:40 minutes |
-
I chose ResNet as the model architecture because it has achieved State-of-the-Art results for many fine-grained image classification problems since 2015. Recent breakthrough in fine-grained image classification such as arXiv:1901.09891v2 and arXiv:1712.01034v2 suggests modification in data augmentation and normalization layers, were built on top of ResNet to obtain the best results.
-
ResNet-152 provides the best accuracy (2-3% increase) over ResNet-50 in the expense of increased training time ( 2 minutes/epoch increase).
-
Several Transfer Learning steps are used to achieve the best performing model (in order) :
- Transfer Learning from model trained with ImageNet images to Mixed-Up Stanford Car's dataset.
- Transfer Learning from model trained with Mixed-Up Stanford Car's dataset to vanilla Stanford Car's dataset.
-
Training data are augmented with several transformations to improve variety of the dataset. This helps model to generalize better. Details of data augmentation are explained in the Stanford Car Model Training.ipynb notebook.
-
Images with higher resolution train better model. However that comes with the expense of training time. Due to time constraint I am not able to train images with higher resolution than 299x299.
-
Training with images squished to target resolution train better model. Automatic cropping risks deleting important features that are out of the cropping boundary. Padding introduce artefacts that lowers the training accuracy. Squished Image preserve most features, except in the scenario where the model/make of a car is mostly determined by the width:height ratio (aspect ratio) of a car.
-
Instead of using squared Image, I have experimented on resizing the dataset to rectangular image with 16:9 and 4:3 aspect ratios. The aim is to preserve features that is determined by the aspect ratio of a car. It shows a slight increase in accuracy (0.3%). However, this is only achievable because of the dataset provided are mostly in landscape.
-
Considering most Grab users are mobile, images taken are usually in portrait. Resizing a portrait image to landscape will severely distort the features of a car. Therefore, I have decided not to select a "rectangular" model as our final model.
-
Training with images cropped with bounding box produces significantly worse results. The model trained was not able to distinguish the noise in the background and the car in the foreground well enough in the test dataset.
-
Augmenting data with mixup yields over 2-3% increase of accuracy.
- Linux Based Operating System (fast.ai does not support MacOS in their current build)
- Use of Virtual Environment such as
conda
orvirtualenv
- 10 GB of free disk space (To be safe). Pytorch, Fast.ai, and their dependencies takes up good amount of disk space.
- (Optional) Git Large File Storage. Used for hosting model files (They are huge).
- (Optional) GPU in machine. This will speed up the prediction by a huge margin if you are running inference on a large dataset.
Before cloning the repository, run:
git lfs install
in the repository directory to initialize Git LFS. Then, clone repository as usual.
OR
If you cloned the repository before initializing, run:
git lfs install
git lfs pull
in the repository directory to download the model file.
Download the best-model.pkl
manually from github and replace the file in your local repository.
Setup a python >= 3.6.0
virtual environement with conda
or virtualenv
with pip
pip install -r requirements.txt
- Activate virtual environment
- Create a fresh directory and place all the test images in the folder. (Make sure there is nothing else other than images in the folder)
- Run
python predict.py generate_csv_for_test_data --img_path=<your_test_folder_path> --output_fpath=<output_file_path>
in terminal. Example:
python predict.py generate_csv_for_test_data --img_path=test_images --output_fpath=test.csv
This will output a csv file with predictions and probability on each images.
- Create a fresh directory and place all the test images in the folder. (Make sure there is nothing else other than images in the folder)
- Create a csv file with two columns,
fname
for image filenames andlabel
for labels of the image.
fname | label |
---|---|
00001.jpg | Suzuki Aerio Sedan 2007 |
00002.jpg | Ferrari 458 Italia Convertible 2012 |
00003.jpg | Jeep Patriot SUV 2012 |
00004.jpg | Toyota Camry Sedan 2012 |
00005.jpg | Tesla Model S Sedan 2012 |
IMPORTANT : fname
in the csv files should match exact the filename of images in the folder. (Filename only, not path)
- Run
python predict.py populate_csv_for_labelled_data --csv_path=<your_csv_path> --img_path=<your_test_folder_path> --output_fpath=<output_file_path>
in terminal. Example:
python predict.py populate_csv_for_labelled_data --csv_path=data_with_labels.csv --img_path=test_images --output_fpath=labelled.csv
This will populate the csv file with predictions and probability for each image. It will also output performance metrics: Accuracy, Recall, Precision, and F1-Score in the terminal.