Stanford Car Fine Grained Image Classification (Grab Challenge)

This repository is an attempt for the Computer Vision Challenge by Grab.

Table of Contents:

Overview
Results
Discussion
Evaluation with Custom Dataset

Overview

Model is built with fast.ai v1 and PyTorch v1, trained on Google Cloud Platform's Deep Learning VM with 16GB NVIDIA Tesla T4.

Data consist of 8144 Training Images (80:20 Train:Validation Split) and 8041 Test Images. Architecture used is ResNet-152 with squared image (299x299), pretrained with ImageNet. Data is augmented with several affine and perspective transformation. Mixup technique is used. Final Top-1 Accuracy is 92.53% on Test Images.

Stanford Car Model Training.ipynb is the notebook used to perform model training and evaluation.

Results

All models are evaluated with Top-1 Accuracy based on the test set provided here.

Stopping Criteria for all models is when no improvement on validation loss across 2 Cycles of training. One cycle of training refers to training with any number of epochs with the One Cycle Policy.

Comparing different image dimension (Squared Image)

Training Technique	Resnet 50	Resnet 101	Resnet 152
Baseline - Image Size (224x224)	87.3	88.9	89.9
Baseline - Image Size (299x299)	88.0	90.3	90.7

299x299 image size yield better results. This criteria is applied to all further models.

Comparing Resizing Methods

Training Technique	Resnet 50	Resnet 101	Resnet 152
Resizing Method - Zero Padding	86.0	-	-
Resizing Method - Crop	86.6	-	-
Resizing Method - Squishing	88.0	-	-

Squishing image yield better results. This criteria is applied to all further models.

Using training set with cropped Bounding Box provided

Training Technique	Resnet 50	Resnet 101	Resnet 152
Without Bounding Box	88.0	90.3	90.7
With Bounding Box	70.3	71.7	71.9

Training Set without bounding box yield better results. This criteria is applied to all further models.

Using Mix Up on training data

Training Technique	Resnet 50	Resnet 101	Resnet 152
Without Mix Up	88.0	90.3	90.7
With Mix Up	89.3	90.9	92.53

Other Performance Metrics

Training done on Google Cloud Platform Deep Learning VM with GPU 16GB NVIDIA Tesla T4, with batch size of 16.

	Resnet 50	Resnet 101	Resnet 152
Training Time per epoch	3:30 minutes	4:10 minutes	5:40 minutes

Discussion

I chose ResNet as the model architecture because it has achieved State-of-the-Art results for many fine-grained image classification problems since 2015. Recent breakthrough in fine-grained image classification such as arXiv:1901.09891v2 and arXiv:1712.01034v2 suggests modification in data augmentation and normalization layers, were built on top of ResNet to obtain the best results.
ResNet-152 provides the best accuracy (2-3% increase) over ResNet-50 in the expense of increased training time ( 2 minutes/epoch increase).
Several Transfer Learning steps are used to achieve the best performing model (in order) :

Transfer Learning from model trained with ImageNet images to Mixed-Up Stanford Car's dataset.
Transfer Learning from model trained with Mixed-Up Stanford Car's dataset to vanilla Stanford Car's dataset.

Training data are augmented with several transformations to improve variety of the dataset. This helps model to generalize better. Details of data augmentation are explained in the Stanford Car Model Training.ipynb notebook.
Images with higher resolution train better model. However that comes with the expense of training time. Due to time constraint I am not able to train images with higher resolution than 299x299.
Training with images squished to target resolution train better model. Automatic cropping risks deleting important features that are out of the cropping boundary. Padding introduce artefacts that lowers the training accuracy. Squished Image preserve most features, except in the scenario where the model/make of a car is mostly determined by the width:height ratio (aspect ratio) of a car.
Instead of using squared Image, I have experimented on resizing the dataset to rectangular image with 16:9 and 4:3 aspect ratios. The aim is to preserve features that is determined by the aspect ratio of a car. It shows a slight increase in accuracy (0.3%). However, this is only achievable because of the dataset provided are mostly in landscape.
Considering most Grab users are mobile, images taken are usually in portrait. Resizing a portrait image to landscape will severely distort the features of a car. Therefore, I have decided not to select a "rectangular" model as our final model.
Training with images cropped with bounding box produces significantly worse results. The model trained was not able to distinguish the noise in the background and the car in the foreground well enough in the test dataset.
Augmenting data with mixup yields over 2-3% increase of accuracy.

Evaluation with Custom Dataset

Prerequisites

Linux Based Operating System (fast.ai does not support MacOS in their current build)
Use of Virtual Environment such as conda or virtualenv
10 GB of free disk space (To be safe). Pytorch, Fast.ai, and their dependencies takes up good amount of disk space.
(Optional) Git Large File Storage. Used for hosting model files (They are huge).
(Optional) GPU in machine. This will speed up the prediction by a huge margin if you are running inference on a large dataset.

Downloading Model File

With Git LFS

Before cloning the repository, run:

git lfs install

in the repository directory to initialize Git LFS. Then, clone repository as usual.

If you cloned the repository before initializing, run:

git lfs install
git lfs pull

in the repository directory to download the model file.

Manual download

Download the best-model.pkl manually from github and replace the file in your local repository.

Setting up virtual environment

Setup a python >= 3.6.0 virtual environement with conda or virtualenv

Installing dependencies

with pip

pip install -r requirements.txt

Running test script

Activate virtual environment

Generate a .csv with predictions based on unlabelled images in a folder

Create a fresh directory and place all the test images in the folder. (Make sure there is nothing else other than images in the folder)
Run python predict.py generate_csv_for_test_data --img_path=<your_test_folder_path> --output_fpath=<output_file_path> in terminal. Example:

See `test_images` folder as sample

python predict.py generate_csv_for_test_data --img_path=test_images --output_fpath=test.csv

This will output a csv file with predictions and probability on each images.

Populate an existing .csv with predictions based on labelled images in a folder

Create a fresh directory and place all the test images in the folder. (Make sure there is nothing else other than images in the folder)
Create a csv file with two columns, fname for image filenames and label for labels of the image.

fname	label
00001.jpg	Suzuki Aerio Sedan 2007
00002.jpg	Ferrari 458 Italia Convertible 2012
00003.jpg	Jeep Patriot SUV 2012
00004.jpg	Toyota Camry Sedan 2012
00005.jpg	Tesla Model S Sedan 2012

IMPORTANT : fname in the csv files should match exact the filename of images in the folder. (Filename only, not path)

Run python predict.py populate_csv_for_labelled_data --csv_path=<your_csv_path> --img_path=<your_test_folder_path> --output_fpath=<output_file_path> in terminal. Example:

See `test_images` folder and `data_with_labels.csv` as sample

python predict.py populate_csv_for_labelled_data --csv_path=data_with_labels.csv --img_path=test_images --output_fpath=labelled.csv

This will populate the csv file with predictions and probability for each image. It will also output performance metrics: Accuracy, Recall, Precision, and F1-Score in the terminal.

jianshen92/stanford-car-grab-challenge