American Sign Language Recognition using Deep Neural Network (Transfer Learning Approach)

1 - Introduction

American Sign Language (ASL) is a complete, natural language that is expressed using the movement of hands and face. ASL provides the deaf community a way to interact within the community itself as well as to the outside world. However, not everyone knows about signs and gestures used in the sign language. With the advent of Artificial Neural Networks and Deep Learning, it is now possible to build a system that can recognize objects or even objects of various categories (like red vs green apple). Utilizing this, here we have an application that uses a deep learning model trained on the ASL Dataset to predict the sign from the sign language given an input image or frame from a video feed. You can learn more about the American Sign Language over here [National Institute on Deafness and Other Communication Disorders (NIDCD) website].

Alphabet signs in American Sign Language are shown below:

2 - Approach

We will utilize a method called Transfer Learning along with Data Augmentation to create a deep learning model for the ASL dataset.

2.1 - Dataset

The network was trained on this kaggle dataset of ASL Alphabet. The dataset contains 87,000 images which are 200x200 pixels, divided into 29 classes (26 English Alphabets and 3 additional signs of SPACE, DELETE and NOTHING).

2.2 - Data Augmentation

So, as to train the model for better real-world scenarios, we have augmented the data using brightness shift (ranging in 20% darker lighting conditions) and zoom shift (zooming out up to 120%).

2.3 - Transfer Learning (Inception v3 as base model)

The network uses Google's Inception v3 as the base model. The first 248 (out of 311) layers of the model (i.e. up to the third last inception block) are locked, leaving only the last 2 inception blocks for training and also remove the Fully Connected layers at the top of Inception network. We then create our own set of Fully Connected layers and add it after the inception network so as to conform the neural network for our application (consists of 2 Fully Connected layers, one consisting of 1024 ReLu units and the other of 29 Softmax units for the prediction of 29 classes). The model is then trained on the set of new images for the ASL Application.

2.4 - Using the model for the application

After the model is trained, it is then loaded in the application. OpenCV is used to capture frames from a video feed. The application provides an area (inside the green rectangle) where the signs are to be presented to be detected or recognized. The signs are then captured in frames, the frame is processed for the model and then fed to the model. Based on the sign made, the model predicts the sign captured. If the model predicts a sign with a confidence greater than 20%, the prediction is presented to the user (LOW confidence sign predictions are predictions above 20% to 50% confidence which are presented with a Maybe [sign] - [confidence] output and HIGH confidence sign predictions are above 50% confidence and presented with a [sign] - [confidence] output where [sign] is the model predicted sign and [confidence] is the model's confidence for that prediction). Else, the model displays nothing as output.

Note: You can download the notebook (American_Sign_Language_Recognition.ipynb) or the PDF version of the notebook (American Sign Language Recognition.ipynb - Colaboratory.pdf) to have a better understanding of the implementation.

3 - Results

For training, Categorical Crossentropy was used to measure the loss along with Stochastic Gradient Descent optimizer (with learning rate of 0.0001 and momentum of 0.9) to optimize our model. The model is trained for 24 epochs. The results are displayed below:

3.1 - Tabular Results

Metric	Value
Training Accuracy	0.9887 (~98.87%)
Training Loss	0.1100
Validation Accuracy	0.9575 (~95.75%)
Validation Loss	0.1926
Test Accuracy	96.43%

3.2 - Graphical Results

4 - Running the application

If you want to try out the application, you might have to satisfy some requirements to be able to run it on your PC.

4.1 - Requirements

Python v3.7.4 or higher (should work with v3.5.2 and above as well)
NumPy
OpenCV v3 or higher
Tensorflow v1.15.0-rc3 (may work on higher versions)[GPU version preferred]
Might require a PC with NVIDIA GPU (at least 2GB graphics memory)
Webcam

4.2 - Clone this repository

Clone this repository using git clone https://github.com/LeonidAlekseev/American-Sign-Language-Recognition-using-Deep-Learning.

4.3 - Executing the script

Open a command prompt inside the cloned repository folder or just open a command prompt and navigate to the cloned directory.
Execute this command: python asl_alphabet_application.py.
An application with a window like the one shown below should pop up after some seconds or minutes (depends on the PC):
Present the signs inside the green rectangular area provided by the application.
Watch the predictions along with the confidence score below the green rectangle.