In this project, I used deep neural networks and convolutional neural networks to classify traffic signs. I trained and validated a model so it can classify traffic sign images using the German Traffic Sign Dataset.
- 1. Dataset Summary
- 2. Design and Test a Model Architecture
- 3 Model development progress
- 3.1 model defining
- 3.2 Model development
- 3.2.1 Baseline training
- 3.2.2 Dropout keepprob experiment
- 3.2.3 L2 Regularization test
- 3.2.4 batch norm experiment
- 3.2.5 rate decay/SGD with momentum experiment
- 3.2.6 experiment with constrained softmax loss layer/Monte carlo simulation
- 3.2.7 inception layer experiment
- 3.2.8 Data augmentation experiment
- 3.3 Final model
- 3.4 evaluation of the test set
- 4. Test a Model on New Images
- 5. Visualize the Neural Network's State with Test Images
For the base training set provided by the website, it has 67% of training set, 9% of validation set and 24% of test set, as summarized below. As the dataset is large, I don't need too much validation set. Otherwise, more validation set shall be needed to prevent overfitting.
Number of training examples = 34799
Number of validation examples = 4410
Number of testing examples = 12630
Image data shape = (32, 32, 3)
Number of classes = 43
training set: validation set: testing set = 0.67 : 0.09 : 0.24
Here I plot the data amount versus unique label for each dataset.
This is important because the train, validation and test sets shall have similar distribution from all different unique labels.
Otherwise, if data distrubutions are heavily different for these data set, it will either affect the training performance or the test performance.
As shown above, the original data set has similar distributions for unique labels.
Below are random examples from each unique label stop sign data. There are 43 different types of stop signs.
This is the first attemp of my image augmentation(not used for final model)
Here I use image augmentation library, and define a function to randomly augment images using augnmentation techniques
The augmentation techniques include:
- Flip image left and right
- crop the image
- apply Gaussian Blur to images
- Strengthen or weaken the contrast in each image
- Add gaussian noise
- make images brighter or darker
- Apply affine transformations to each image(Scale/zoom them, translate/move them, rotate them and shear them)
seq = iaa.Sequential([
iaa.Fliplr(0.5), # horizontal flips
iaa.Crop(percent=(0, 0.1)), # random crops
# Small gaussian blur with random sigma between 0 and 0.5.
# But we only blur about 50% of all images.
iaa.Sometimes(0.5,
iaa.GaussianBlur(sigma=(0, 0.5))
),
# Strengthen or weaken the contrast in each image.
iaa.ContrastNormalization((0.75, 1.5)),
# Add gaussian noise.
# For 50% of all images, we sample the noise once per pixel.
# For the other 50% of all images, we sample the noise per pixel AND
# channel. This can change the color (not only brightness) of the
# pixels.
iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.05*255), per_channel=0.5),
# Make some images brighter and some darker.
# In 20% of all cases, we sample the multiplier once per channel,
# which can end up changing the color of the images.
iaa.Multiply((0.8, 1.2), per_channel=0.2),
# Apply affine transformations to each image.
# Scale/zoom them, translate/move them, rotate them and shear them.
iaa.Affine(
scale={"x": (0.8, 1.2), "y": (0.8, 1.2)},
translate_percent={"x": (-0.2, 0.2), "y": (-0.2, 0.2)},
rotate=(-25, 25),
shear=(-8, 8)
)
], random_order=True) # apply augmenters in random order
In the 1st attemp of augmentation, I use above function to randomly augment images
It crops and affines transformations to images, flips some of the images horizontally, adds a bit of noise and blur and also changes the contrast as well as brightness.
Through this process I have 10 times more images for my traning set, which is a huge addition.
Note: this data set is not used for final model, because after testing, I found that I have added too much noise to the training set, which leads to degregation of of model performance
examples of 1st attemp augmentation images:
This is the second attemp of my image augmentation(This augmentation data set is used for final model)
From the 1st attemp I found that I should not add too much noise on images. The reason is that the augmented images may be impossible for machine or human beings to interpret after heavy augmentation.
In the 2nd try of augmentation, I use less heavy augmentation techniques to get a reasonable traning set. This ends up getting better perfomance in my final model.
Below is the list of augmentations.
- make images brighter or darker
- Apply affine transformations to each image(Scale/zoom them, translate/move them, rotate them and shear them)
The original training images are saved to a local file train_aug_adjust.p
The new training set is 5 times larger than the original training set.
Here are examples of 2st attemp augmentation images. We can see that it is not heavy augmentation like the 1st attempt
Through augmentation, now we have:
Number of augmented training examples = 208794
Number of validation examples = 4410
Number of testing examples = 12630
Image data shape = (32, 32, 3)
Number of classes = 43
aug training set: validation set: testing set = 0.92 : 0.02 : 0.06
Now the data set ratio is as shown below. I have much more training set than the original provided data set. The new training set is store as train_aug.p
The preprocessing includes:
- apply normalization for images. Normalized inputs are easier for model parameters to get small and reasonable mean and variance
- apply grayscale to images. This is in order to give less input for our model to train on. Here are examples of normalized and grayscale images:
final input image shape fed to Convnet shape is : (208794, 32, 32, 1). Note that data are shuffled before fed into Convnet
Here is the network of my training layers.
Basically what I use is the original LeNet achitecture, plus an incetion layer in the middle, which provides better performance and better converging speed.
For the inception layer, here is the achitecture:
Layer | Description |
---|---|
Input | 32x32x1 normalized grayscale image |
Convolution#1 | 1x1 stride, valid padding, outputs 28x28x6 |
RELU#1 | |
Max pooling#1 | 2x2 stride, valid padding, outputs 14x14x6 |
Convolution#2 | 1x1 stride, valid padding, outputs 10x10x16 |
RELU#2 | |
Max pooling#2 | 2x2 stride, valid padding, outputs 5x5x16 |
inception | maxpool(same pad), outputs 5x5x31 |
Flatten#2 | outputs: 400 |
Full Connection#0 | outputs:120 |
Full Connection#1 | output: 84 |
Full Connection#2 | output: 43 |
Output log | 43 logits |
I start with a baseline training with final training accuracy 0.99 and final validation accuracy 0.92. This is using the hyberparameters from Lenet lab homrwork. The performance of the baseline training is treated as the baseline of my model developement. Any later expriment model perfomance is compared to this baseline performance
For this baseline training I use:
- batchsize : 64
- learningrate: 0.001
- optimizer: Adam
- EPOCH: 20
- layers: original Lenet
- regularizations: no
problem 1: It seems like the batch size too small affect the stability
problem 2: learning rate and other hyper parameters can be tuned to be better*
problem 3: This is clearly overfitting!
So What I do is to do a lot of experiments on regularization techniques, and hyperparameters tuning
For this experiment , I add dropout technique to the two FC layers of Lenet, and try different keepprob value from 0.3 to 1.0, to see the accuracy and loss performance.
I Save the parameters I use into a csv file, and save plots into a pdf file in 'hyper_para_testing\dropout\dropout_test.pdf'
Through all the test result, I found out that dropprob 0.5 has the best performanc, as shown below.
Through dropout keepprob 0.5 I increase the validation accuracy performance from 0.94 to 0.96
For this experiment , I combine L2 norm with dropout technique to see what happends to the performance.
I multiply 0.001 to the L2 term and add it to the loss function
I Save the parameters I use into a csv file, and save plots into a pdf file in 'hyper_para_testing\dropout_and_L2_norm\dropout_and_L2norm.pdf'
Through the test result, I found out that L2 norm does not increase my performance with already having dropout, as shown below.
Therefore I choose not to use L2 norm.
I've heard that batch norm can increase model performance, and having slight regularization effect on models.
Here I add batch norm layers and try it with/without dropout to see what heppens.
You can find the whole results from :'hyper_para_testing\batchnorm\batchnorm.pdf'
Below is the result of batch norm combined with dropout. It seems that it hurts the performance somehow. I assume normally it won't hurt model performance, here it maybe because I already normalize the input, so it does not make much contribution to my model.
In the later experiment, I did not use batch norm.
SGD normally has better performance in small data set according to some essays.
Also, rate decay normally contributes to model converging.
Here what I try is using momentum optimizer with rate decay, instead of Adam, to see what happens.
You can find the result from :'hyper_para_testing\rate_decay\rate_decay.pdf' I use: -initial weight: 0.025 -ratedecay: 0.85(For ecah epoch, rate = rate * ratedecay) -momentum factor:0.9
Below is the plot with the result.
It seems for my model, SGD with momentum and rate decay is much more stable than Adam, so I decide to use momentum optimizer for my later model
These are more like experiments of my personal interetes. They do not contribute to my final model, So I don't show them here.
If you are interested to see these results, please look into the hyper_para_testing folder for the results.
Basically, constrained softmax loss layer is from someone's paper to say that it has regularization effect.
Monte carlo simulation is to try randomly picking hyperparameters to get possible optimized parameters.
Inception layer is a Google developed layer which helps model performance and converging.
Here I use some hyperparameters from monte carlo simulation and try inception layer on Lenet.
We can see that the for my current data set and model it does not do anything. I think for more complex model or data set, it can enhance the intepretation of my model, and helps increasing the performance. Actually I already found it useful in my later experiment with augmented data set.
From above I have achieved 96% accuracy for validation set. but it's still overfitting!
So I decide to try the data augmentation technique.
As stated in section 2, I have try two different augmentation data set.
The 1st augmented training set is a heavily augmented data set, which is very noisy.
I end up not able to get better training accuracy or validation accuracy more than 94%.
As a result, I choose to make the 2nd augmentation training set. My final model with this augmentation training set gets very good result as shown in next section 3.3
For final model I have achived 98.4% accuracy for training set, and 98.3% for validation set
Note that for final model I used the inception layer, because it significantly increases my training speed, and does not seem to hurt my performance.
For final model I use, below are details and hyperparameters used for the model:
- dropout keepprob : 0.5
The dropout keepprob is tuned to be 0.5 to get reasonably good regularization on my model
- initial rate: 0.035
- rate decay: 0.88 (rate = 0.88 * rate for each EPOCH)
*I use rate, rate decay and some logic for rate.
example for rate logic:
if train_accuracy<0.97 and rate<0.01:
n_rate = 0.01
elif train_accuracy>0.97 and train_accuracy<0.98 and rate<0.004:
n_rate = 0.005
elif train_accuracy>0.98 and train_accuracy<0.983 and rate<0.0015:
n_rate = 0.004
#elif train_accuracy>0.98 and train_accuracy<0.983 and rate<0.0015:
n_rate = 0.002
elif train_accuracy>0.983 and train_accuracy<0.985 and rate<0.008:
n_rate = 0.0015
elif train_accuracy>0.985 and train_accuracy<0.987 and rate<0.0004:
n_rate = 0.0008
These rate conditions are set based on fine tuning them on the augmented training set, to help model successfully converges to a good optimal point. The augmentation training set is harder than the original training set to be trained on. When I tried fixing rate or just simple rate decay, the model seems to easily stuck on local optimal point. Through these conditions I am getting very good accuracy.
- batch size: 512 for the 1st part ot training, and 4096 for the 2nd part ot training.
I choose these batch size because my 1080ti is able to handle this amount of memory, and the model converges fast enough and to a reasonable point.
- EPOCHS: 210 EPOCHS
The training accuracy seems hard to grow after this EPOCHS. I think is a reasonable early stop point to prevent overfitting
- L2 norm: not used
I got good regularization from dropout so I don't bother L2 norm
- momentum factor:0.9
0.9 seems good on training process
In this first part of training, I use the hyper parameters as shown above, and let the model stops at training accuracy 98% and validation accuracy > 98%. This end out using 112 EPOCHS.
Since I doubt that the model performance can be improved futherly, I decided to train more epochs to get better final performance.
What I do is that I load the model just trained on, and train some more EPOCHS, and let the model stop at training accuracy > 98.4% and validation accuracy > 98.3%. This end out using another 98 EPOCHs for the training.
I this period, I set tighter learning rate and batch size 4096 for a better converging effect.
The final model performance is shown in the following figure.
The final model is saved to '.\lenet_final'
Here are some of my answers for the questions:
What architecture was chosen?
Lenet 5 with an inception layer.
Why did you believe it would be relevant to the traffic sign application?
Because it has amazing performance on the classification for numbers, I believe it shall have similar performance for stop signs, since they are both multiclass classification problem.
How does the final model's accuracy on the training, validation and test set provide evidence that the model is working well?
The training accuracy of 98.4%, and validation accuracy of 98.3% indicate that the model works well on th training part, and it's not overfitting/underfitting. The 95.7% one test set shows that it works well on out of sample prediction. However, further improvement can be included to make it better.
Our model accuracy on the test set is 95.7%. Futher improvement can be made by:
- adding more diversified validation set, training set
- improve model achitecture
- better tuning hyperparameters
- better training process
Below are the precision and recall for the model, which can give me information about which label performance needs to improve for my future model.
label1 recall:0.8833 precision:0.8983
label2 recall:0.9875 precision:0.9556
label3 recall:0.9733 precision:0.9505
label4 recall:0.9044 precision:0.9532
label5 recall:0.9652 precision:0.9830
label6 recall:0.9413 precision:0.9095
label7 recall:0.9400 precision:0.9463
label8 recall:0.9222 precision:0.9857
label9 recall:0.9911 precision:0.9272
label10 recall:1.0000 precision:0.9677
label11 recall:0.9894 precision:0.9969
label12 recall:0.9357 precision:0.9269
label13 recall:0.9870 precision:0.9913
label14 recall:0.9944 precision:0.9958
label15 recall:1.0000 precision:0.8882
label16 recall:0.9905 precision:0.9952
label17 recall:1.0000 precision:1.0000
label18 recall:0.9361 precision:0.9825
label19 recall:0.8641 precision:0.9656
label20 recall:0.9667 precision:0.9831
label21 recall:1.0000 precision:0.9574
label22 recall:0.7222 precision:0.8904
label23 recall:0.9000 precision:1.0000
label24 recall:0.9933 precision:0.8514
label25 recall:0.8778 precision:0.8681
label26 recall:0.9875 precision:0.9151
label27 recall:0.9778 precision:0.9462
label28 recall:0.5667 precision:0.7556
label29 recall:0.9800 precision:0.9735
label30 recall:0.9111 precision:0.8723
label31 recall:0.7400 precision:0.7551
label32 recall:0.9926 precision:0.9926
label33 recall:1.0000 precision:0.9836
label34 recall:0.9810 precision:0.9810
label35 recall:1.0000 precision:0.9836
label36 recall:0.9590 precision:0.9973
label37 recall:0.9750 precision:0.9590
label38 recall:0.9500 precision:0.9500
label39 recall:0.9536 precision:0.9821
label40 recall:0.9556 precision:0.9053
label41 recall:0.8889 precision:0.8989
label42 recall:0.8000 precision:0.9796
label43 recall:0.8889 precision:0.9877
I have downloaded 8 pictures of German traffic signs from web and use my model to predict the traffic sign type.
Below are the images after resizing.
Below are traffic sign type predictions for each image:
label value for test1.jpg: 1, predict value for test1.jpg: 1
label value for test2.jpg: 27, predict value for test2.jpg: 27
label value for test3.jpg: 40, predict value for test3.jpg: 40
label value for test4.jpg: 11, predict value for test4.jpg: 11
label value for test5.jpg: 25, predict value for test5.jpg: 25
label value for test6.jpg: 38, predict value for test6.jpg: 38
label value for test7.jpg: 17, predict value for test7.jpg: 17
label value for test8.jpg: 18, predict value for test8.jpg: 18
As you can see, for 8 web images, my model accuracy is 100%. Cool! Awesome!
the images I chose is not hard actuallt for my model (and for human beings) to classify. I guess it will be harder if I pick some images that are harder to be classified.
Here I output top 5 softmax probabilities for each image found on the web:
top 5 Softmax Probabilities for test1.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 1 | 0 | 2 | 4 | 3
Top probability is label 1, which is predicted as:Speed limit (30km/h)
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test2.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 27 | 11 | 30 | 25 | 21
Top probability is label 27, which is predicted as:Pedestrians
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test3.jpg:
probabilities:89.835%| 7.610%| 2.554%| 0.000%| 0.000%
predict label: 40 | 7 | 12 | 1 | 39
Top probability is label 40, which is predicted as:Roundabout mandatory
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test4.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 11 | 30 | 27 | 37 | 21
Top probability is label 11, which is predicted as:Right-of-way at the next intersection
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test5.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 25 | 0 | 1 | 2 | 3
Top probability is label 25, which is predicted as:Road work
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test6.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 38 | 0 | 1 | 2 | 3
Top probability is label 38, which is predicted as:Keep right
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test7.jpg:
probabilities:100.000%| 0.000%| 0.000%| 0.000%| 0.000%
predict label: 17 | 38 | 12 | 39 | 34
Top probability is label 17, which is predicted as:No entry
--------------------------------------------------------------------------------
top 5 Softmax Probabilities for test8.jpg:
probabilities:97.964%| 2.034%| 0.001%| 0.001%| 0.000%
predict label: 18 | 26 | 27 | 24 | 25
Top probability is label 18, which is predicted as:General caution
--------------------------------------------------------------------------------
It seems that model is 100% certain about image number 1,2,4,5,6,7, but is not 100% certain about image 3 and 8.
Let's visualize what the model is confused with for test3.jpg:
For test8.jpg, here is the visulization of the confused image:
The confused labe image looks really similar to the true lable image.
For the test set the model has 95.7% accuracy, while for new images it has 100% accuracy.
This comparison is actually not fare because I have too little amount of images for the new images data set.
If I increase the number of new images, the comparison will make more sense.
Take an example of the following input image:
Below are some visulizations of the layers output:
conv1:
conv1 activation:
conv1 maxpool:
conv2:
conv2 activation:
inception: