/Speaker-Recognition

Machine Learning Approach to built a robust speaker recognition model using MFCC features and GMM universal background model.

Primary LanguagePython

Speaker Recognition

This is the extension of the work done by Atul-Anand-Jha in the implementation of MFCC(Mel's Frequency Cepestral Coefficient) and GMM(Gausian Mixture Model) which can be accessed from the following link:

1. Algorithmic Details:

In the feature extraction, MFCC(Mel's Frequency cepestral Coefficient's) are used which emphasizes on extraction of the low frequency components and their cepestral coefficients from the audio files. Precedure for feature extraction can be better described in the figure below:

a. Feature Extraction

MFCC Feature Extraction      Procedure

Basically the frequency is calculated at the frame level by using windowing technique and the each frame of audio is converted into the ferquency domain representation using Discrete Fourier Transform. From those frequency of a frame, mel's cepestral coefficient (MCC) is calclated which is again conerted back into the logarthmic scale representation and finally converted back into the time domain representation using discrete fourier transform. Overall process of calculating a MFCC feature is done at the frequency domain.

b. Model Representation using Gausian Mixture Model

To represent the model of the each speaker, GMM i.e Gausian Mixture Model is used. Basically this particular technique relies on generalizing the Gaussians that arises from the each feature extracted from the audio files of a particular speaker during training phase.

Gaussian Mixture Model Generalizing the individual Gaussians present in the feature array.

The dotted line above can be infered as the feature present in the each speaker's audio file, while the solid line can be infered as being Generalized gaussian present in the feature space.

2. Usage

Python Version: 3.7 or above Libraries required can be found at requirements.txt file included in the repository.

pip install requirements.txt

a. Training the speaker's audio file

Audio File Supported: i. Audio file type: .wav ii. Channel: 2(stereo)

For the training and creating speaker's model, one needs to provide the individual speaker's file inside the dataset/train folder with the name of the speaker. Browse training folder's list

Then to generate the GMM model of the corresponding speaker's simply use:

python train.py <speaker's file name>

The GMM model of each speaker is dumped at the Speaker's Model folder using pickle.

Tips: More is the data, more is the accuracy ;)

b. Predicting

For prediting simply record your sound and keep it inside the predict folder present inside the dataset folder predict folder. Then use the following command to predict the file

python predict.py "filename.mp3"

3. Accuracy

For accuracy testing, one needs to provide the each speaker's audio file in the test folder. Those audio files of speaker must be provided each indivudual's speaker's name.

Caution: Make sure that the name of the testing phase speaker's folder is same as the training phase folder for each speaker's audio.

Then simply use the command below to test the accuracy of the model:

python accuracy_test.py

4. Accuracy Measured

For the accuracy testing precision, recall and f-score of each speaker's model was measured using confusion matrix. In the training folder an average of 10 minute audio was kept while 9 different audio files per speaker was used.

In case of non-noisy data:

Confusion Matrix Plot

Accuracy Thus Measured

b. In case of noisy data:

Confusion Matrix Plot

Accuracy Thus Measured