Automatic Speech Recognition

Dataset Overview
Data Visualization
Methodology used
Observation
Deduction

ASR model is implemented with the help of machine leaning and natural language processing.

Dataset used for this task is sourced from Google.
The dataset is composed of short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".
Each of 8 commands have 1000 different samples of voices.
Each WAV file contains time-series data with a set number of samples per second.
Each Sample represents the amplitude of the audio signal at that specific time.
The total dataset consists of 8k samples which is further splitted in 80% training samples and 10%-10% for validating and testing

Methodology Used

We have used a CNN( convolutional neural network) model, CNN utilizes correlations which exist with the input data. Each concurrent layer of the neural network connects some input neurons.

First of all our data is in .wav form.
We will transform these waveforms from the time-domain signals into the frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms, which show frequency changes over time.
Spectrogram is a visual way of representing the signal strength or loudness of a signal over time at various frequencies.
These can be represented as 2D images
To feed data in our model we have used these spectrogram images .

Observation

The following can be deduced from the plots seen of various command waveforms using spectrograms :

There is some extent of overlapping across different categories of commands.
For commands like “NO” and “STOP” , both spectrogram and mel- scale spectrogram are kind of similar.
The sample rate for this dataset is 16kHz.( padding is done if sample rate is less)
Range of frequency is In a 32-bit floating-point system, each WAV files in the
dataset has, the amplitude values range from between -1.0 to +1.0.
Due to the presence of noise makes it a difficult classification challenge.

Confusion matrix showing how well the model did classifying each of the commands in the test set:

Deduction

Some of deductions made from the uesd CNN model are :

Precision Score of the model is: 0.8125
Recall Score of the model is : 0.8125
F1 Score of the model is: 0.812
With few layers of CNN, we can only determine less features, but for deep learning stage we need extract more features.
Due to the problem of overfitting, performance of model is degraded.
Also because of less amount of dataset, CNN is not able to perform well.

macck7/Automatic-Speech-recognition-ASR-

Automatic Speech Recognition

Table of Contents

Methodology Used

Observation

Deduction