The aim of this project is to detect and classify simple spoken commands from one-second long audio by learning from a labeled training set and testing it on an unlabeled test set.
The dataset used is the Speech Commands Datasets which was released by TensorFlow. It includes 65,000 one-second long utterances of 30 short words, by thousands of different people. However, in this project challenge, we were supposed to classify the audio for one of the 12 classes, namely: yes
, no
, up
, down
, left
, right
, on
, off
, stop
, go
, silence
, unknown
. Note that the unknown
label is used for a command that is not one one of the first 10 labels or that is not silence
.
I implemented 3 neural network architectures:
- Combination of RNN LSTM nodes and CNN,
- CNN with residual blocks similar to ResNet,
- Deep RNN LSTM network;
Using the above, I compared their performance to detect 12 speech commands. The audio data is preprocessed to generate Spectogram images, followed by data augmentation and normalization. Achieved test accuracy of 74%, 76% and 71% in those 3 architectures respectively.
RNN: Recurrent Neural Network
LSTM: Long Short Term Memory
CNN: Convolutional Neural Network