Towards single integrated spoofing-aware speaker verification embeddings

`Get started`. Installation dependencies.

requirements.txt is included in each stage directory (All of them are the same so you can use any of them).

If you use the Anaconda virtual environment,

conda create -n sasv python=3.9 cudatoolkit=11.3
conda activate sasv

Install all dependency packages,

pip3 install -r requirements.txt

`Stage 1`. Speaker classification-based Pre-training.

In Stage 1, the ability to discriminate between target and bona fide non-target speakers can be learned using the VoxCeleb2 database which contains data collected from thousands of bona fide speakers. In this repository, we provide the pre-trained weights of the following models:

ECAPA-TDNN: ecapa_tdnn.model (64MB)
MFA_Conformer: mfa_conformer.model (89MB)
SKA-TDNN: ska_tdnn.model (123MB)

Model	params	SASV-EER (%)	SV-EER (%)	SPF-EER (%)
`ECAPA-TDNN`	16.7M	20.66	0.74	27.30
`MFA-Conformer`	20.9M	20.22	0.41	26.52
`SKA-TDNN`	29.4M	16.74	0.38	22.38

You can evaluate the pre-trained weights using the following commands:

cd stage3

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model ECAPA_TDNN \
        --initial_model /path/to/weight/ecapa_tdnn.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model MFA_Conformer \
        --initial_model /path/to/weight/mfa_conformer.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model SKA_TDNN \
        --initial_model /path/to/weight/ska_tdnn.model

`Stage 2`. Copy-synthesis Training.

In Stage 2, we augment the model with the ability to discriminate between bona fide and spoofed inputs by using large-scale data generated through an oracle speech synthesis system, referred to as copy synthesis. This repository has the copy-synthesis training using copy-synthesized data from VoxCeleb2 dev or ASVspoof2019 LA train/train+dev.

`Stage 3`. In-domain Fine-tuning.

Even though training in Stages 1 and 2 learn to discriminate bona fide non-target and spoof non-target inputs, there is a remaining domain mismatch with the evaluation protocol. Furthermore, artefacts from the acoustic model have yet to be learned. Hence, in Stage 3, we fine-tune the model using in-domain bona fide and spoofed data contained within the ASVspoof2019 LA dataset.

`Summary`. Experimental results and pre-trained weights for several models.

	Stage1	Stage2	Stage3	SASV-EER	SASV-EER	SASV-EER	SASV-EER
	ASV-based Pre-training	Copy-synthesis Training	In-domain Fine-tuning	`SKA-TDNN` `train`	`SKA-TDNN` `train+dev`	`MFA-Conformer` `train`	`MFA-Conformer` `train+dev`
1	-	-	`ASVspoof2019` `(bna+spf)`	9.55	5.94	11.47	7.67
2	`VoxCeleb2` `(bna)`	-	-	-	16.74	-	20.22
3	`VoxCeleb2` `(bna)`	-	`ASVspoof2019` `(bna+spf)`	2.67	1.25	2.13	1.51
4	-	`VoxCeleb2` `(bna+cs)`	-	-	13.11	-	14.27
5	-	`VoxCeleb2` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	2.47	1.93	1.91	1.35
6	`VoxCeleb2` `(bna)`	`VoxCeleb2` `(bna+cs)`	-	-	10.24	-	12.33
7	`VoxCeleb2` `(bna)`	`VoxCeleb2` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	1.83	1.56	1.19	1.06
8	-	`ASVspoof2019` `(bna+cs)`	-	13.10	10.49	13.63	12.48
9	-	`ASVspoof2019` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	9.57	6.17	13.46	10.11
10	`VoxCeleb2` `(bna)`	`ASVspoof2019` `(bna+cs)`	-	5.62	4.93	9.31	8.32
11	`VoxCeleb2` `(bna)`	`ASVspoof2019` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	2.48	1.44	2.72	1.76

You can download each pre-trained weight from the above links:

Citation

If you utilize this repository, please cite the following paper,

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Proc. Interspeech},
  year={2020}
}

@inproceedings{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  booktitle={Proc. Interspeech},
  year={2022}
}

@inproceedings{mun2022frequency,
  title={Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification},
  author={Mun, Sung Hwan and Jung, Jee-weon and Han, Min Hyun and Kim, Nam Soo},
  booktitle={Proc. IEEE SLT},
  year={2022}
}

sasv-challenge/SASV2_Baseline

Towards single integrated spoofing-aware speaker verification embeddings

Get started. Installation dependencies.

Stage 1. Speaker classification-based Pre-training.

Stage 2. Copy-synthesis Training.