ML-Group-Project-8: Pothole Detection on the Jetson Nano

Introduction

With the new era of autonomous vehicles, there is an ever growing necessity for autonomous road safety and road defect detection. Our objective in this project is to develop machine learning models which detect potholes in roads, a small step towards making autonomous driving safer. The data we are currently planning to use comes from the Brazilian National Department of Transport Infrastructure, and consists of 2235 images of highways in the states of Espírito Santo, Rio Grande do Sul and the Federal District from 2014 to 2017. The resolution of the images is at least 1280x729 with a 16:9 aspect ratio. We hope to train deep convolutional neural networks and also an SVM image classification/object detection to consistently identify potholes in new road images, with the end-goal of real time video inference using a Jetson Nano. We will also consider many data preprocessing methods such as image masking/transforming, hog extraction, negative image extractiongreyscale extraction, and scaling in our pipeline to further optimize our models. Our hypothesis is that a sufficiently large deep convolutional neural network is capable of accurately classifying road defects, and we hope to optimize its performance with what we have learned in class and previous work in the field.

Methods

Data Creation

The first step is to create the dataset with the images, their labels, and the parameters for the pothole bounding boxes. We will use keras to load and convert images to numpy arrays, and cv2 to detect the potholes and label them. We store the dataset as a csv file. This is what a sample/mask looks like with its corresponding bounding box

The table organizes the different attributes of the images into columns. The img column has the gray scale representation of the images that were pulled from git for more a more efficient calculations. The smaller number decreases the complexity of the convolving process. The pothole being the target of the project is encoded as 1 and 0 for true and false of whether certain images contain potholes or not.

After creating the csv dataset, it looks like this:

Data Exploration

We are working with a dataset that contains 2235 samples (images). The target classes are 0 and 1, which correspond to the given road containing a pothole (1) or not (0). As we can see in the target class distribution, the data is a bit imbalanced. We have 564 samples with potholes, and 1671 samples without. This can lead to oversampling where the data chosen for training is skewed towards one class. Therefore, it's important that we shuffle and split our data carefully.

What each of the target classes look like:

The distributions of bounding box widths & heights are very right skewed, as one would expect.

Data Preprocessing

Our dataset consists of 2236 pairs of images. Each image is either 630 by 1024 or 640 by 1024. In order to standardize, we scale each image down to 600 by 600. This also makes the training easier by decreasing the dimensions we input into our model [1].
Once the image is loaded into our python environment as a PIL object, we convert to grayscale. This is actually only needed for the original images as they are in color and the pothole masks are already black and white.
We then normalize the image by turning it into a numpy array and dividing by 255.
Our result is 2236 image pairs all of which are (600, 600) numpy arrays with float values ranging from 0 to 1.

[1] The main reason is that as long as the rescaling doesn't significantly distort the relevant features of an image, shrinking it down allows us to build a deeper model and reduce the computational load on our models. Reducing the costs allows us to test on even more data. We will test it out but most likely we are going to end up shrinking the images even further (to 250 by 250) later on.

Model 1: Simple Model

Our first model is a Convolutional Neural Network with the layers:

This simple model has 4 convolutional layers and 1 Dense layer with 62 nodes. We used 15 epochs, with a batch size of 2, the Adam optimizer with a learning rate of 0.0001, and $MSE$ as the loss function. This model has 5 outputs: the bounding box coordinates as well as the class.

Model 2: YOLO Model

Our second model is similar to the original yolo v1 object detection CNN, with the layers:

This model has 20 convolutional layers and 5 Dense layers, with Batch Normalization and Leaky ReLu like in the original yolo v1 paper. We used 15 epochs, with a batch size of 2, and the Adam optimizer with a learning rate of 0.001. We decided to make this model a regression only model, meaning it only outputs the bounding box predictions and not the class.

Model 3: VGG16 Model

Our third model extends an already existing network with set initial weights. We altered the VGG16 Network Head with our own trainable Dense layers to output the predicted bounding box coordinates:

This model has 13 convolutional layers and 4 Dense layers, with Max Pooling in between layers. We used 10 epochs, with a batch size of 2, and the Adam optimizer with a learning rate of *0.0001. This model is also a regression only model, meaning it has 4 outputs corresponding to the bounding box coordinate predictions.

Model 4: Branched VGG16

Our fourth model is a work in progress. It is similar to model 3, as it extends an existing network with set initial weights. We altered the VGG16 Network Head with two of our own trainable Dense layers to output the predicted bounding box coordinated and the predicticted class labels.

This model has 13 convolutional layers and 4 Dense layers for each branch, with Max Pooling in between layers. We used 15 epochs, with a batch size of 2, and the Adam optimizer with a learning rate of *0.0001. This model has 5 outputs, corresponding to the 4 bounding box coordinate predictions and the binary pothole classification.

Model 5: Binary Classification using CNNs with dropout)

Due to the nature of limited computational resources and a fairly small dataset in the grand scheme of things, we thought it would be best to make a model with more modest aims. I.e. instead of training a model which finds bounding boxes, why not just make a model that tells us whether a pothole is present in the image or not? This is binary classification, and is a task more suitable for the methods we learned in class (basic CNNs).

First thing’s first, we need to label our dataset. Our dataset is technically labeled with masks. Each mask is a black-and-white image with white pixels in the regions where potholes are. So for preprocessing, we can just look at the numpy array representation of a mask and if there are any values of 255 in the array then we classify the corresponding image as one with a pothole. We do this in the notebook dataset_org.ipynb and then rearrange the dataset into two folders: one for images with potholes and one for those without. We then download this new version of the dataset from colab and upload it to a public google drive zip file for easy access. The reason that we put the images into folders according to classification is that it allows us to turn the dataset into a tf.data.Dataset object which is a very efficient and convenient way to load the dataset when working in keras. It basically allows us to only load the images into memory when a batch is needed, where it then gets the images from disk and applies preprocessing methods. Now let’s get into the actual model:

Model 6: SVM (Support Vector Machine) Model for image classification and pothole detection

We introduce an SVM Object detection model from kaggle that was inspired by Mehmet Tekman which classified cars. We modify the code and use it to classify images and even further detect potholes via the SVM bounds. After data preprocessing we augment the images via hogs, negative images, and gray scale. The gray scale images (gray colors), hog (Histogram of Oriented Gradients) images (Images that highlight contours and distinct images like potholes), and negative images (inverse colors). The main decision also uses the heatmap, and the heatmap/image predictions are displayed in results. We also resize the image in order to simplify the amount of detail the model needs to use to study the image.

Negative Image Example:

We see the inverted colors make it easier for SVM to potentially detect the pothole

Hog Image Example:

As we can see in the image, Hog (Histogram of Oriented Gradients), counts the number of occurences of a gradient in a certain rotation in one part of the image. We use this to extract edges and features from the image, hopefully the ones from the pothole specifically. The reason why we use hog over the image is because images often have certain variations in terms of occlusion, color, light, etc. This noise is reduced by the hog and serves as a representation of the image without the noise explained earlier.

Results

We are using IOU as an accuracy metric for the bounding boxes. Intersection over Union (IOU) is defined as the area of overlap divided by the area of union of the predicted and true bounding boxes. Typically, an IOU > 0.5 is very good.