This repository is the SSTC2022 baseline system, including:
- Environment preparation
- Data preparation
- Model training
- Embedding extracting
- Performance calculating
Please visit https://sstc-challenge.github.io/ for more information about the challenge.
We recommend installing dependencies in the conda environment
conda create -y -n baseline python=3.8
conda activate baseline
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
pip install nemo_toolkit['all']==1.14.0
The system adopts the online data augmentation method for model training. Please prepare the MUSAN and RIR_NOISES dataset and modify the path of './data/musan/' and './data/rir_noise/' files as your data path.
For voice conversion, we utilize Librispeech as the source speaker dataset and voxceleb1&2 as the target speaker dataset. The converted speech datasets can be download in https://sstc-challenge.github.io/.
The data preparation file follows the Kaldi form that participants need "wav.scp", "utt2spk", "spk2utt" files for training dir, and "wav.scp" and "trials" for valuation dir. The "./data/vox2dev" shows the training example files and "./data/vox1-O" shows the valuation example files. There are five data dir need to be prepared in the baseline system recipe:
# target speaker dataset
./data/Vox2dev/
./wav.scp
./utt2spk
./spk2utt
# source speaker dataset
./data/librispeech/train/
./wav.scp
./utt2spk
./spk2utt
# training dataset
./data/vcdata/
./train_1/
./wav.scp
./utt2spk
./train_2/
./wav.scp
./utt2spk
...
./train_8/
./wav.scp
./utt2spk
# development dataset
./data/vc-dev/
./dev_1.scp
./dev_2.scp
...
./dev_12.scp
./dev_trials # with keys
# evaluation dataset
./data/vc-eval/
./eval_1.scp
./eval_2.scp
...
./eval_k.scp
./eval_trials # without keys
We employ a pre-training strategy on the VoxCeleb2 development set to allow the model to initially learn general features, so as to obtain better performance on subsequent tasks.
The following are the pre-trained model (half small MFA_Conformer) results on vox1-O.
Vox1-O (EER) | Download Link |
---|---|
1.145% | Download |
Running:
python train.py --save_dir 8vc \
--data_name train_1 train_2 train_3 train_4 train_5 train_6 train_7 train_8 \
--warmup_epochs 1 --dur_range 2 2 \
--val_data_name vc-dev \
--batch_size 512 --workers 40 \
--mels 80 --fft 512 \
--model ConformerMFA --embd_dim 256 \
--classifier ArcFace --angular_m 0.2 --angular_s 32 --dropout 0 \
--gpu 0,1,2,3 --epochs 25 --start_epoch 0 --lr 0.001 &
There are three modes for scoring.py,
# Extract speaker embedding and compute the EER and mDCF
scoring = True
onlyscoring = False
# Extract speaker embedding
scoring = False
onlyscoring = False
# Compute EER and mDCF
scoring = False/True
onlyscoring = True
Running:
# For example, dev-1
python scoring.py --save_dir 8vc \
--val_data_name vc-dev --vc_method dev_1 --val_save_name dev_1 --model_num 24 \
--onlyscore False --scoring True --trials dev_trials \
--gpu 0 &
Dev-1 | Dev-2 | Dev-3 | Dev-4 | Dev-5 | Dev-6 |
---|---|---|---|---|---|
9.397% | 8.619% | 7.671% | 7.594% | 7.507% | 12.885% |
Dev-7 | Dev-8 | Dev-9 | Dev-10 | Dev-11 | Dev-12 |
32.484% | 28.795% | 34.045% | 45.772% | 17.209% | 20.808% |
Download Link: Download.
We utilize adapter for multi-task(Source speaker identification and conversion method identification) learning, running:
python train_method.py --save_dir 8vc_method \
--data_name train_1 train_2 train_3 train_4 train_5 train_6 train_7 train_8 \
--loss_w 1.0 \
--warmup_epochs 1 --dur_range 2 2 \
--val_data_name vc-dev \
--batch_size 512 --workers 40 \
--mels 80 --fft 512 \
--model ConformerMFA_MultiTask --embd_dim 256 \
--classifier ArcFace --angular_m 0.2 --angular_s 32 --dropout 0 \
--gpu 0 --epochs 25 --start_epoch 0 --lr 0.001 &
# For example, dev-1
python scoring_method.py --save_dir 8vc_method \
--val_data_name vc-dev --vc_method dev_1 --val_save_name dev_1 --model_num 24 \
--onlyscore False --scoring True \
--total_methods 8 --method_idx 0 \
--gpu 0 &
Below is the performance of the model, we can see that the closed-set classification accuracy can reach 100% without compromising the source speaker verification performance.
Dev-1 | Dev-2 | Dev-3 | Dev-4 | Dev-5 | Dev-6 |
---|---|---|---|---|---|
9.575% 100% |
8.421% 100% |
7.629% 100% |
7.700% 100% |
7.403% 100% |
12.841% 100% |
Dev-7 | Dev-8 | Dev-9 | Dev-10 | Dev-11 | Dev-12 |
32.113% 100% |
27.911% 100% |
32.595% - |
44.527% - |
17.409% - |
20.828% - |
Extract all method embeddings from the training sets and development sets with the trained method recognition model.
# For example, Train-1 and Dev-1
python scoring_method.py --save_dir 8vc_method \
--val_data_name vcdata --vc_method train_1 --val_save_name train_1 --model_num 24 \
--onlyscore False --scoring False \
--total_methods 8 --method_idx 0 \
--gpu 0 &
python scoring_method.py --save_dir 8vc_method \
--val_data_name vc-dev --vc_method dev_1 --val_save_name dev_1 --model_num 24 \
--onlyscore False --scoring False \
--total_methods 8 --method_idx 0 \
--gpu 0 &
# Train-1's method embeddings are saved in "train_1_cla_24.npy"
# Dev-1's method embeddings are saved in "dev_1_cla_24.npy"
The details can be found in the "osnn.ipynb" file, and the steps are as follows:
- Extract all method embeddings from the training set with the trained method recognition model, and randomly partition them into two subsets at a ratio of 1:9, denoted as TS1 and TS9.
- Take the subset TS9 and calculate the average of embeddings for each method to obtain the class center of each method.
- Calculate the Euclidean distance from test sample x to each class center, and compute the ratio of distances R between its two nearest neighbor centers Ci and Cj . In which, Ci represents the nearest neighbor, and Cj represents the second nearest neighbor.
- Given a specified threshold T, if the distance ratio R is less than the threshold, consider the test sample belonging to method i; otherwise, consider the sample belonging to an unseen method. (The threshold T is determined by the subset TS1)
The final open-set classification accuracy is as follows:
Dev-1 | Dev-2 | Dev-3 | Dev-4 | Dev-5 | Dev-6 |
---|---|---|---|---|---|
97.44% | 99.91% | 99.69% | 99.78% | 98.49% | 91.27% |
Dev-7 | Dev-8 | Dev-9 | Dev-10 | Dev-11 | Dev-12 |
99.55% | 98.76% | 99.81% | 99.77% | 99.66% | 97.39% |