/SpeakerVerifiaction-pytorch

Speaker Verification using Pytorch

Primary LanguageJupyter NotebookMIT LicenseMIT

Speaker Recognition Systems - Pytorch Implementation

At the beginning, this project was forked and started from the qqueing/DeepSpeaker-pytorch.

1. Datasets

Prepare data in kaldi way, make features in Process_Data and store shuffled features with random length in egs. Other stages are processed in this resposity.

  • Development:

Voxceleb1、Voxceleb2、Aishell1&2、CN-celeb1&2、aidatatang_200zh、MAGICDATA、TAL_CSASR ChiME5、VOiCES、CommonVoice、AMI

  • Augmentation:

MUSAN、RIRS

  • Test:

SITW、Librispeech、TIMIT

1.1 Pre-Processing

  • Resample

  • Butter Bandpass Filtering

  • Augmentation

  • LMS Filtering ( Defected )

1.2 Accoustic Features

  • MFCC

  • Fbank

  • Spectrogram

2. Deep Speaker Verification Systems

2.1 Neural Networks

  • TDNN

The TDNN_v2 is implemented from 'https://github.com/cvqluu/TDNN/blob/master/tdnn.py'. The TDNN_v4 layer is implemented using nn.Conv2d. The TDNN_v5 layer implemented using nn.Conv1d

ETDNN

FTDNN

DTDNN

Aggregated-Residual TDNN

ECAPA TDNN

ResCNN

LSTM

LSTM and Attention-based LSTM

Input 40-dimensional MFCC.

  • ResNet

ResNet34

2.2 Loss Type

Classification
  • A-Softmax

  • AM-Softmax

  • AAM-Softmax

  • Center Loss

  • Ring Loss

End-to-End
  • Generalized End-to-End Loss

  • Triplet Loss

  • Contrastive Loss

  • Prototypical Loss

  • Angular Prototypical Loss

2.3 Pooling Type

  • Self-Attention

  • Statistic Pooling

  • Attention Statistic Pooling

  • GhostVALD Pooling

3. Score

  • Cosine

  • PLDA

  • DET

  • t-sne

4. Disrization

  • Hierarchical Agglomerative Clustering

5. Neural Network Analysis

  • Gradient

  • Grad-CAM

  • Grad-CAM++

  • Full-Grad

. To do list

Work accomplished so far:

  • Models implementation
  • Data pipeline implementation - "Voxceleb"
  • Project structure cleanup.
  • Trained simple ResNet10 with softmax+triplet loss for pre-training 10 batch and triplet loss for 18 epoch , resulted in accuracy ???
  • DET curve

Timeline

  • Extract x-vectors from trained Neural Network in 20190626
  • Code cleanup (factory model creation) 20200725
  • Modified preprocessing
  • Modified model for ResNet34,50,101 in 20190625
  • Added cosine distance in Triplet Loss(The previous distance is l2) in 20190703
  • Adding scoring for identification
  • Fork plda method for classification in python from: https://github.com/RaviSoji/plda/blob/master/plda/

5. Performance

5.1 Baseline

Group Model epoch Loss Type Loss Train/Test Accuracy (%) EER (%)
1 Resnet-10 1:22 Triplet 6.6420:0.0113 0.8553/0.8431 ...
ResNet-34 1:8 CrossEntropy 8.0285:0.0301 0.8360/0.8302 ...
2 TDNN 40 CrossEntropy 3.1716:0.0412 vox1 dev/test 99.9994/99.5871 1.6700/5.4030
2 TDNN 40 CrossEntropy 3.0382:0.2196 vox2 dev/vox1 test 98.5265/98.2733 3.0800/3.0859
ETDNN .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
FTDNN .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
DTDNN .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
ARETDNN .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
3 LSTM .... ... ... ... ...
LSTM .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
LSTM .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
2 ... .... ... ... ... ...
2 TDNN .... Softmax 8.0285:0.0301 0.8360/0.8302 ...
  • TDNN_v5, Training set: voxceleb 2 161-dimensional spectrogram, Loss: arcosft, Cosine Similarity

    Test Set EER ( % ) Threshold MinDCF-0.01 MinDCF-0.01 Date
    vox1 test 2.3542% 0.2698025 0.2192 0.2854 20210426
    sitw dev 2.8109% 0.2630014 0.2466 0.4026 20210515
    sitw eval 3.2531% 0.2642460 0.2984 0.4581 20210515
    cnceleb test 16.8276% 0.2165570 0.6923 0.8009 20210515
    aishell2 test 10.8300% 0.2786811 0.8212 0.9527 20210515
    aidata test 10.0972% 0.2952531 0.7859 0.9520 20210515

5.2 Baseline

6. Reference:

[1] Cai, Weicheng, Jinkun Chen, and Ming Li. "Analysis of Length Normalization in End-to-End Speaker Verification System.." conference of the international speech communication association (2018): 3618-3622.

[2] ...