inzva AI Projects #5 - Speaker Identification
In this project we tried to solve the problem of Speaker Identification which is a process of recognizing a person from a voice utterance. We implemented the methods propsed in Deep CNNs With Self-Attention for Speaker Identification paper on both Tensorflow-Keras and Pytorch.
We used below datasets:
VCTK dataset is easy to use, no license agreement is required and it is easy to use after download.
For the VoxCeleb dataset, it is recommended to visit its website to sign up and find download and conversion scripts for the datasets.
The data split text file for identification will be required.
The files under dataloaders used for loading the data with datagens in Keras and dataloaders in Pytorch. The scripts can generate file paths in runtime or read from a txt file directly. It is recommended to generate txt files. Check this notebook to generate such a file.
It is also recommended to generate pickle files from audio features first and load them. Our data loaders works with that way too. Check out scripts under utils folder to create such files.
Before feeding the audio files into our models, we extract filter bank coefficients from them. Check out here for the complete process. Our implementation is under utils/preprocessed_feature_extraction.py
We implemented below architectures:
We achieved
After training our models, we extracted embeddings with the trained model and used knn algorithm to find closest neighboors of the extracted embeddings. Such system can be used to find the closest voice utterances and their class labels for a given audio signal.
Check out extract_embeds.py and closest_celeb.py scripts for the implementation of this method.
- Keras
- Pytorch
- MatPlotLib
- TensorFlow
- Pickle
- Numpy
- Librosa