/1D-Triplet-CNN

PyTorch implementation of the 1D-Triplet-CNN neural network model described in Fusing MFCC and LPC Features using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals by A. Chowdhury, and A. Ross.

Primary LanguagePythonMIT LicenseMIT

1D-Triplet-CNN

PyTorch implementation of the 1D-Triplet-CNN neural network model described in Fusing MFCC and LPC Features using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals by A. Chowdhury, and A. Ross.

Research Article

Anurag Chowdhury, and Arun Ross, Fusing MFCC and LPC Features using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals, IEEE Transactions on Information Forensics and Security (2019).

1D-Triplet-CNN Model

1D-Triplet-CNN Details

Implementation details and requirements

The model was implemented in PyTorch 1.2.1 using Python 3.6 and may be compatible with different versions of PyTorch and Python, but it has not been tested.

Additional requirements are listed in the ./requirements.txt file.

Usage

Source code and model parameters

The source code of the 1D-Triplet-CNN model can be found in the model subdirectory, and a pre-trained model is available in the trained_models subdirectory.

Dataset

The pre-trained model avilable in the trained_models subdirectory was trained on a subset of Fisher speech corpus obtained from https://catalog.ldc.upenn.edu/LDC2004S13. The training data was also degraded with varying degrees of Babble noise obtained from NOISEX-92 dataset.

Training the 1D-Triplet-CNN model

In order to train a 1D-Triplet-CNN model as described in the research paper, use the 1D-Triplet-CNN implementation given in the models subdirectory. The network attains optimal performance when trained using a triplet learning framework. Read the research paper for more details on training the model.

Testing with the pretrained model

Recommended audio specifications

Usually, 2 seconds of speech audio sampled at 8000KHz is enough to produce reliable speaker recognition results. Longer audio samples will make the recognition task significantly slower with no significant benefits to performance. Audio samples smaller than 1secs with have considerable performance loss.

Usage

  1. Satisfy the requirements listed in the ./requirements.txt file.
  2. Run src/extractFeatures.m in MATLAB R2019a(or newer) to extract MFCC-LPC features from audio files placed in sample_audio subdirectory and save corresponding features as individual .mat files in sample_feature subdirectory.
  3. Run src/test.py in Python 3.6 to evaluate some sample audio pairs for generating speaker verification scores.

Examples

Some usage examples might be added in future.