TensorFlow Speech Recognition Challenge

Aim

The aim of this project is to detect and classify simple spoken commands from one-second long audio by learning from a labeled training set and testing it on an unlabeled test set.

Dataset

The dataset used is the Speech Commands Datasets which was released by TensorFlow. It includes 65,000 one-second long utterances of 30 short words, by thousands of different people. However, in this project challenge, we were supposed to classify the audio for one of the 12 classes, namely: yes, no, up, down, left, right, on, off, stop, go, silence, unknown. Note that the unknown label is used for a command that is not one one of the first 10 labels or that is not silence.

Implementation

I implemented 3 neural network architectures:

Combination of RNN LSTM nodes and CNN,
CNN with residual blocks similar to ResNet,
Deep RNN LSTM network;

Using the above, I compared their performance to detect 12 speech commands. The audio data is preprocessed to generate Spectogram images, followed by data augmentation and normalization. Achieved test accuracy of 74%, 76% and 71% in those 3 architectures respectively.

RNN: Recurrent Neural Network

LSTM: Long Short Term Memory

CNN: Convolutional Neural Network

sdhayalk/TensorFlow_Speech_Recognition_Challenge

TensorFlow Speech Recognition Challenge

Aim

Dataset

Implementation