At the beginning, this project was forked and started from the qqueing/DeepSpeaker-pytorch.
Prepare data in kaldi way, make features in Process_Data and store shuffled features with random length in egs. Other stages are processed in this resposity.
- Development:
Voxceleb1、Voxceleb2、Aishell1&2、CN-celeb1&2、aidatatang_200zh、MAGICDATA、TAL_CSASR ChiME5、VOiCES、CommonVoice、AMI
- Augmentation:
MUSAN、RIRS
- Test:
SITW、Librispeech、TIMIT
-
Resample
-
Butter Bandpass Filtering
-
Augmentation
-
LMS Filtering ( Defected )
-
MFCC
-
Fbank
-
Spectrogram
- TDNN
The TDNN_v2 is implemented from 'https://github.com/cvqluu/TDNN/blob/master/tdnn.py'. The TDNN_v4 layer is implemented using nn.Conv2d. The TDNN_v5 layer implemented using nn.Conv1d
ETDNN
FTDNN
DTDNN
Aggregated-Residual TDNN
ECAPA TDNN
ResCNN
LSTM
LSTM and Attention-based LSTM
Input 40-dimensional MFCC.
- ResNet
ResNet34
-
A-Softmax
-
AM-Softmax
-
AAM-Softmax
-
Center Loss
-
Ring Loss
-
Generalized End-to-End Loss
-
Triplet Loss
-
Contrastive Loss
-
Prototypical Loss
-
Angular Prototypical Loss
-
Self-Attention
-
Statistic Pooling
-
Attention Statistic Pooling
-
GhostVALD Pooling
-
Cosine
-
PLDA
-
DET
-
t-sne
- Hierarchical Agglomerative Clustering
-
Gradient
-
Grad-CAM
-
Grad-CAM++
-
Full-Grad
Work accomplished so far:
- Models implementation
- Data pipeline implementation - "Voxceleb"
- Project structure cleanup.
- Trained simple ResNet10 with softmax+triplet loss for pre-training 10 batch and triplet loss for 18 epoch , resulted in accuracy ???
- DET curve
- Extract x-vectors from trained Neural Network in 20190626
- Code cleanup (factory model creation) 20200725
- Modified preprocessing
- Modified model for ResNet34,50,101 in 20190625
- Added cosine distance in Triplet Loss(The previous distance is l2) in 20190703
- Adding scoring for identification
- Fork plda method for classification in python from: https://github.com/RaviSoji/plda/blob/master/plda/
Group | Model | epoch | Loss Type | Loss | Train/Test | Accuracy (%) | EER (%) |
---|---|---|---|---|---|---|---|
1 | Resnet-10 | 1:22 | Triplet | 6.6420:0.0113 | 0.8553/0.8431 | ... | |
ResNet-34 | 1:8 | CrossEntropy | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
2 | TDNN | 40 | CrossEntropy | 3.1716:0.0412 | vox1 dev/test | 99.9994/99.5871 | 1.6700/5.4030 |
2 | TDNN | 40 | CrossEntropy | 3.0382:0.2196 | vox2 dev/vox1 test | 98.5265/98.2733 | 3.0800/3.0859 |
ETDNN | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
FTDNN | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
DTDNN | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
ARETDNN | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
3 | LSTM | .... | ... | ... | ... | ... | |
LSTM | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
LSTM | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... | ||
2 | ... | .... | ... | ... | ... | ... | |
2 | TDNN | .... | Softmax | 8.0285:0.0301 | 0.8360/0.8302 | ... |
-
TDNN_v5, Training set: voxceleb 2 161-dimensional spectrogram, Loss: arcosft, Cosine Similarity
Test Set EER ( % ) Threshold MinDCF-0.01 MinDCF-0.01 Date vox1 test 2.3542% 0.2698025 0.2192 0.2854 20210426 sitw dev 2.8109% 0.2630014 0.2466 0.4026 20210515 sitw eval 3.2531% 0.2642460 0.2984 0.4581 20210515 cnceleb test 16.8276% 0.2165570 0.6923 0.8009 20210515 aishell2 test 10.8300% 0.2786811 0.8212 0.9527 20210515 aidata test 10.0972% 0.2952531 0.7859 0.9520 20210515
[1] Cai, Weicheng, Jinkun Chen, and Ming Li. "Analysis of Length Normalization in End-to-End Speaker Verification System.." conference of the international speech communication association (2018): 3618-3622.
[2] ...