Roadmap | Paper | Runtime (x86_gpu) | Python binding | Pretrained Models | Huggingface Demo
WeSpeaker mainly focuses on speaker embedding learning, with application to the speaker verification task. We support online feature extraction or loading pre-extracted features in kaldi-format.
- Clone this repo
git clone https://github.com/wenet-e2e/wespeaker.git
- Create conda env: pytorch version >= 1.10.0 is required !!!
conda create -n wespeaker python=3.9
conda activate wespeaker
conda install pytorch=1.12.1 torchaudio=0.12.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
- If you just want to use the pretrained model, try the python binding!
pip3 install wespeakerruntime
-
2023.02.27: Update onnxruntime (C++), see onnxruntime, #135
-
2023.02.15: Update the code for multi-node training. For how to setup multi-node training, please refer to #131.
-
2022.11.30: Multi-Query Multi-Head Attentive Pooling (MQMHASTP) and Intertopk-Subcenter Loss are supported, see #115.
- VoxCeleb: Speaker Verification recipe on the VoxCeleb dataset
- 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achiving 0.447%/0.043 EER/mindcf on vox1-O-clean test set
- 🔥 UPDATE 2022.07.19: We apply the same setups as the CNCeleb recipe, and obtain SOTA performance considering the open-source systems
- EER/minDCF on vox1-O-clean test set are 0.723%/0.069 (ResNet34) and 0.728%/0.099 (ECAPA_TDNN_GLOB_c1024), after LM fine-tuning and AS-Norm
- CNCeleb: Speaker Verification recipe on the CnCeleb dataset
- VoxConverse: Diarization recipe on the VoxConverse dataset
- Model (SOTA Models)
- Pooling Functions
- TAP(mean) / TSDP(std) / TSTP(mean+std)
- Comparison of mean/std pooling can be found in shuai_iscslp, anna_arxiv
- Attentive Statistics Pooling (ASTP)
- Mainly for ECAPA_TDNN
- Multi-Query and Multi-Head Attentive Statistics Pooling (MQMHASTP)
- Details can be found in MQMHASTP
- TAP(mean) / TSDP(std) / TSTP(mean+std)
- Criteria
- Scoring
- Cosine
- PLDA
- Score Normalization (AS-Norm)
- Metric
- EER
- minDCF
- Online Augmentation
- Noise && RIR
- Speed Perturb
- SpecAug
- Training Strategy
- Well-designed Learning Rate and Margin Schedulers
- Large Margin Fine-tuning
- Automatic Mixed Precision (AMP) Training
- Literature
For Chinese users, you can scan the QR code on the left to follow our offical account of WeNet Community
.
We also created a WeChat group for better discussion and quicker response. Please scan the QR code on the right to join the chat group.
If you find wespeaker useful, please cite it as
@article{wang2022wespeaker,
title={Wespeaker: A Research and Production oriented Speaker Embedding Learning Toolkit},
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
journal={arXiv preprint arXiv:2210.17016},
year={2022}
}
If you are interested to contribute, feel free to contact @wsstriving or @robin1001