Project: Build a Traffic Sign Recognition Program

Overview

This project is part of Self-Driving Car Nano Degree program at Udacity. The main objective of the project is to use deep learning to classify traffic sign images from the German Traffic Signs Data set. Training, validation and test is done using the dataset availalbe for the project. Furthermore, after the best performing model has been chosen, it is used to classify images downloaded from the Internet. The notebooks and other resources available in this repo has been forked from the project repo.

The basic structure of the project is as follows. This report summarizes the results.

Loading the data set
Exploration, summary and visualization of the data set
Design, train and validate a model architecture
Use the model to make predictions on new images
Analyze the softmax probabilities of the new images
Remarks
Performance on Test Dataset

NOTE: All the dependencies and lab environment required for reproducing the results presented in this report can be created with CarND Term1 Starter Kit.

Folder Contents

Here is an overview of what this repo contains.

├── Main.ipynb: the main notebook
├── Main.html: HTML export form Main.ipynb
├── data/: folder for storing data sets, models and run logs.
├── docs/: additional document resources (eg. images generated from the notebook)
├── main.py: main logic from the notebook as a python file for easy testing on the server
└── src/: contains the core python code for supporting the notebook

My notebook is based on the template notebook for the project available here. The images and tables shown in this documentation have been generated from the notebook and just embedded or copied to this document.

Loading the Dataset

The images were provided in pickled format as part of the project. The files can be downloaded from here.

The basic structure of the pickled files (there are three of them) is as follows.

features is a 4D array containing raw pixel data of the traffic sign images, (num examples, width, height, channels).
labels is a 1D array containing the label/class id of the traffic sign. The file signnames.csv contains id -> name mappings for each id.
sizes is a list containing tuples, (width, height) representing the original width and height the image.
coords is a list containing tuples, (x1, y1, x2, y2) representing coordinates of a bounding box around the sign in the image. These coordinates assume the original image the pickled data contains resized versions(32 by 32 )of these images.

Exploring the Dataset

Here's a result of exploring training and validation data sets. Test data set was not explored except for basic summary statistic to avoid influencing model selection.

Dataset	No. of examples	No. of classes	Image shape
Training	34799	43	(32, 32, 3)
Validation	4410	43	(32, 32, 3)
Test	12630	43	(32, 32, 3)

The signal names can be found in data/signnames.csv. There is a utility method in util.py to convert a given label to sign name.

The following figure shows 25 random images from the training data set.

As seen in the distribution below, the training and validation examples are not uniformly distributed per label. Nevertheless, the imbalance is not so stark that it has few examples for 1 or 2 classes.

The following graphic shows the ratio of the distribution of training examples and validation examples by labels. In general, the training set seems to be representative of the validation set.

Model Architecture

My model is based on the LeNet architecture from Yann Lecun. The basic overview of the original architecture is shown in figure below. Main building blocks of the model are a series of convolutional layers followed by a series of fully connected layers to return softmax probabilities on the predicted labels. The convolutional layers are followed by ReLU activation layers and max-pooling layers.

Performance of the original architecture on the training dataset was already satisfactory. However, the performance on the validation dataset was below par (minimum of 93% was required for the project). Tweaking learning rates or the number of epochs did not help. This was expected and the model showed signs of over-fitting. Drop-out layers were introduced after each activation layer to improve performance on the validation data set.

To avoid dropping out during prediction, my model accepts an additional parameter called keep_probability that uses either the value from hyper-parameter dictionary from lenet.py during training, or the value is set to 1 while predicting.

To enrich the training data set, multiple data augmentation techniques were tried out. Augmenting the training set with horizontally flipped, randomly scaled and rotated, gaussian blurred copies of the provided examples resulted in generally higher validation accuracy. All augmenters can be found in augmentation.py. For other versions of augmenters that were tried out, refer to the git history of the file. The final set of augmenters chosen were as follows.

1x Gaussian Blur augmenter with kernel size of 3x3
5x Affine transformation augmenter with random scaling and random rotations in the range of [0.9, 1.1] and [-15, 15] degrees respectively as described in Sermanet et. al. (reference below). The center for rotation was picked with [-2, 2] pixels offset from the image center.

Together, they result in the training data for this project to be 6 times the original size.

Pre-processing

The images were converted to grayscale and Z-normalization was performed on the training data. The pre-processors can be found in preprocessing.py. The mean and sigma were derived from the augmented training examples and the values were just used for pre-processing validation and test data sets.

The channel weights for the RGB channels were [0.299, 0.587, 0.114] as suggested in this blog post.

From one of the runs, the Z-normalization values were as follows. The actual value differed based on the pixel values of after the augmentation, but the values were almost equal to the following up to one decimal place. The actual values from each run can be found in the run log.

Mean	Sigma
77.3	65.1

The following shows two instances of training images before and after applying pre-processing. For both cases, the overall contrast seems to be preserved in the grayscale image and in the Keep Right case (the bottom image), the contrast is slightly better.

Model

The final architecture was as follows. The source code is available at LeNet function in lenet.py..

Layer	Input	Output	Details
Input	32x32xC	-	The model supports any number of channels
Layer 1 Convolutional	32x32xC	28x28x6	Valid padding with ReLU activation
Layer 1 Pooling + Dropout	28x28x6	14x14x6	Max pooling and dropout layers
Layer 2 Convolutional + activation	14x14x6	10x10x16	Valid padding with ReLU activation
Layer 2 Pooling + Dropout	10x10x16	5x5x16	Max pooling and dropout layers
Layer 3 Fully connected	400	120	Flattened output from previous layer as input with ReLU activation
Layer 3 Dropout			With 70% keep probability
Layer 4 Fully connected	120	84	With ReLU activation
Layer 4 Dropout			With 70% keep probability
Layer 5 Fully connected	84	43	Outputs logits

As for the hyper-parameters, the starting values were chosen from the hand-writing recognition project from the course. With 70% keep probability for the dropout layers, the model was performing quite well. The starting values for the weights in the network were chosen from a truncated normal distribution with mean (mu) 0 and standard deviation (sigma) 0.1.

Hyper-parameter	Value	Comments
Learning rate	0.001	Experimenting with 0.0009 gave slight but not so significant improvement in some cases. Learning was a bit slower though. So, 0.001 was chosen.
Epochs	50	The validation accuracy reached a plateau already at around 30-40 epochs
Batch size	1024	An epoch of training took around 10s, which was good enough.
Keep probability	70%	Started with 80%, but lower value yielded in higher validation accuracy as expected.
mu	0	Value inherited from MNIST dataset classification project
sigma	0.1	Value inherited from MNIST dataset classification project

Please refer to runs.txt file for detail on how the accuracy was affected by various combinations of hyper-parameters.

Overall, the validation accuracy reached over 97% with the chosen model and the hyper-parameters.

Make prediction on new images

Five images were downloaded from the Internet by searching for "traffic signs" in Google Images Search. I randomly picked images from signs that were in the list in signnames.csv. I did not really select any particular characteristic in the traffic signs that I chose.

The images were cropped to just show the signs, and the canvas size was changed to 32x32 as required by the model. GIMP tool was chosen for this task.

Here are all the images that were downloaded after they have been cropped and resized.

Analyze Predictions on new Images

Actual Traffic Sign	Predicted Sign	Probability	Match
Speed limit (50km/h)	Speed limit (30km/h)	0.50	❌
Speed limit (70km/h)	Speed limit (70km/h)	1.0	✅
Priority road	Priority road	0.999	✅
Right-of-way at the next intersection	Right-of-way at the next intersection	0.999	✅
Turn right ahead	Turn right ahead	1.0	✅

The following graphic shows the soft-max probabilities for the signs. The full list of predictions can be found in the notebook.

Although it might appear that the accuracy for the images downloaded from the Internet is 80%, the model really gets confused with the numbers inside speed limit signs. It got the first sign wrong and the second sign was almost tipping over wrongly to 20km/h instead of 70km/h in some runs. This could have been due to mismatch in quality of the signs in the training dataset and the test image from the Internet. Furthermore, the signs might not even have been European signs.

I specifically looked for traffic signs showing speed limits from European roads and I found an 80km/h speed limit from a French road. Running prediction on that sign yielded better results. To improve the overall performance of the model it would make sense to maybe apply contrast filters to highlight the contents of the signs even more.

The following graphic shows the soft-max probabilities on the 80km/h sign.

Remarks

Due to time constraint, I skipped visualising layers using the outputFeatureMap function. Investigating how the network learnt particularly the traffic signs with speed limits could shed a bit more light on why they are not distinguished with high confidence.

Performance on the Test Dataset

Finally, I ran evaluation on the test dataset. Performance on the test data set with the aforementioned settings and architecture was 95.2%.

References

LeCun, Yann. "LeNet-5, convolutional neural networks." URL: http://yann.lecun.com/exdb/lenet/
Sermanet, Pierre, and Yann LeCun. "Traffic sign recognition with multi-scale Convolutional Networks." IJCNN. 2011. URL: http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf

saksham/CarND-Traffic-Sign-Classifier-Project