Distracted Driver Detection

Jeremy Collins, Alan Hesu, Kin Man Lee, Shruthi Saravanan, Dhrumin Shah

DDD_clip.mp4

Introduction

Distracted driving causes about 920,000 total accidents in the US per year according to the National Highway Transportation and Safety Administration (NHTSA) [1]. Experts in the fields of traffic safety and public health all concur that this is an underestimation of the dangers of distracted driving.

Prior work in this field includes a data augmentation method for distracted driving detection based on extracting relevant driving operation areas in the image as a preprocessing step using an R-CNN model [2]. The findings from this paper demonstrate the importance of doing operation area extraction in the preprocessing step, which can efficiently reduce redundant information in images and improve classification accuracy. An additional report in this field includes a distracted driving identification algorithm based on deep CNNs [3]. This approach coupled PCA with a multi-layer CNN to further improve performance.

In this project, we will use the State Farm Distracted Driver Detection dataset [4] to classify normal driving, texting, phone conversation, radio operation, drinking, reaching back, doing hair and makeup, and talking to passengers.

Problem Definition

70% of fatal crashes are caused by unsafe driving behavior. Our project aims to preemptively identify such driving behavior in order to help prevent fatal accidents and adjust a driver’s habits.

Data Collection

We chose to use the State Farm Distracted Driver Detection dataset, a collection of 22,424 images of drivers operating a vehicle [4]. Because labels were only provided for the training data, we split the training portion of the dataset into a new training set (80%) and test set (20%). An option was also included to shuffle the dataset before splitting, determining whether the test set will contain unseen individuals. To access items in the dataset, we grouped each image along with the ground truth label and person ID obtained from a csv file.

In order to avoid limitations imposed by the size of our dataset and improve generalizability, we chose to augment the data in several ways. Transformations included resizing, color jitter, rotation, random cropping, and normalization. All images were converted to grayscale to reduce the memory usage and to remove color information from the data. An edge detection filter was also applied, as the edges of each subject, and thus their pose and position, would provide the most relevant information.

Examples of images in the dataset and the class they belong to are shown below:


subject 016, Safe driving	subject 052, Texting - right


subject 026, Talking on the phone - right	subject 015, Texting - left


subject 024, Talking on the phone - left	subject 012, Operating the radio


subject 042, Drinking	subject 035, Reaching behind


subject 052, Hair and makeup	subject 022, Talking to passenger

The following is a distribution of all of the classes of images found in the dataset:

Methods

Unsupervised methods

PCA

The processed dataset still has a very large number of features, so principal component analysis (PCA) was performed to further reduce the dimensionality of the dataset and extract the most relevant features that maximize the variance in the data. This effectively performs compression on each image and subsequently reduces the computation and memory requirements for training models on the data. To further reduce memory requirements, the IncrementalPCA method in sklearn was used to incrementally fit the PCA components on batches of data. A graph of the explained variance ratio was generated, and a suitable number of components was chosen to balance capturing as much variance as possible while making the memory requirements feasible.

A transformed set of features was then fit using the training data. For testing, the same transformation was applied to the testing set. The sklearn PCA library performs the normalization of the data, and because each feature is a pixel value, they are already on the same scale, and the input to PCA is thus well-formed.

Autoencoder

As an alternative to PCA, a convolutional autoencoder was used to compress the image and reduce the data to a lower-dimensional feature representation. The autoencoder consisted of an encoder and decoder, which were each composed of mirrored sequences of convolutional layers with leaky ReLU activation functions. The final layer in the encoder and first layer in the decoder consisted of a flattened dense layer that determined the number of features in the resulting encoded feature representation. At a vector length of 1000, the final autoencoder consisted of approximately 24.6 million parameters.

To improve the convergence of the autoencoder, the input images were normalized to a mean and standard deviation of (0.5, 0.5, 0.5). The autoencoder itself was trained using the Adam optimizer with a learning rate of 5e-4 and a batch size of 64 for 10 epochs.

K-means

A K-means model was used as an unsupervised clustering method. For a given number of clusters k, a model was fit using the training data. Then, each cluster was associated with one of the ten ground truth labels by giving the cluster the same label as the highest occurring category in that cluster. Thus, an arbitrary number of k clusters can be used to classify images as one of the ten actual labels. Once fit, the model can be used to classify images in the testing set. To select an optimal number of clusters, an elbow plot was generated, plotting the model accuracy against the number of clusters and selecting a k value at the “elbow” of the curve.

Supervised Methods

Support Vector Machines

For experiments with support vector machines, the unshuffled and shuffled dataset was trained and fitted with a linear SVM and a 3rd-degree polynomial kernel. Brief experimentation was also done using a radial basis function (RBF) kernel. The dataset has been preprocessed to grayscale with edge detection. Once fitted, we classify the images in the testing set.

Convolutional Neural Networks

Several architectures were implemented and evaluated in an attempt to classify the images. Our chosen architectures were ResNet, VGG16, EfficientNet-B4, and a custom CNN architecture. In order to prevent overfitting the data with such large models, several forms of regularization were used. The images were augmented by randomly changing the brightness, contrast, saturation, and hue of the image, and were randomly cropped and rotated. Additionally, a weight decay was introduced to punish the magnitude of the learned weights, and a dropout layer was added between the linear layers so that the model was not dependent on specific features, making it more robust and capable of generalization.

For the custom CNN model, we used 6 convolutional layers in total, with 1 max-pooling layer and 1 GlobalAveragePooling layer in the middle. The total number of convolutional layers was extended from 8 to 512 filters. Before employing the fully connected layer, a dropout was utilized in conjunction with the Global average pooling layer. The number of nodes in the final fully linked layer was set at ten, with the softmax activation algorithm enabled. All additional layers were activated using the ReLU function. To minimize dimensionality, a max-pooling layer is applied. In addition to that, to reduce dimensionality and convert the matrix to a row vector, a Global Average Pooling layer was implemented. This is due to the fact that a completely linked layer can only receive row vectors. Dropout layers were integrated to prevent overfitting and to guarantee that it generalized properly. Finally, to derive prediction probabilities, the final fully connected layer featuring a softmax activation function was applied.

In terms of the VGG16 model, we loaded it with its training weights but without the top fully connected layers. Then, on top of the GlobalAveragePooling layer, we added fully connected layers with dropout layers in between. Finally, on top of the model, we added three dense layers with dropout layers.

Results and Discussion

PCA

For the un-augmented dataset, a number of reduced features of approximately 250 appears to capture a sufficient amount of variance in the data before the explained variance curve begins to flatten and the marginal increase in variance of each additional component gets lower and lower.

An example of performing the transformation to compress the image is shown below.

For a dataset with augmentations, however, the required number of features to achieve the same explained variance ratio increases dramatically. This is because the variation across images in the dataset has increased. This is especially true with augmentations that perform geometric transformations on the images such as a translation, crop, or rotation, causing shapes and edges in the images to shift their locations. Because PCA is dependent on the “locality” of the dataset, transformations that affect these locality relationships will also affect the performance of PCA.

A similar reason can be used to explain differences in the performance of PCA depending on if the subjects are shuffled in the dataset prior to the train-test split. By qualitative analysis, the results of applying the transformation on the test images vary dramatically. An example of the PCA transformation when the data is shuffled is shown below,

while an example of the PCA transformation when the data is not shuffled is shown here.

The PCA components have been fit on the specific subjects in the training set and thus fail to capture the geometric differences in the images given new subjects in the test set.