Towards single integrated spoofing-aware speaker verification embeddings

`Stage 1`. Speaker classification-based Pre-training.

In Stage 1, the ability to discriminate between target and bona fide non-target speakers can be learned using the VoxCeleb2 database which contains data collected from thousands of bona fide speakers. In this repository, we provide the pre-trained weights of the following models:

ECAPA-TDNN: ecapa_tdnn.model (64MB)
MFA_Conformer: mfa_conformer.model (89MB)
SKA-TDNN: ska_tdnn.model (123MB)

Model	params	SASV-EER (%)	SV-EER (%)	SPF-EER (%)
`ECAPA-TDNN`	16.7M	20.66	0.74	27.30
`MFA-Conformer`	20.9M	20.22	0.41	26.52
`SKA-TDNN`	29.4M	16.74	0.38	22.38

You can evaluate the pre-trained weights using the following commands:

cd stage3

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model ECAPA_TDNN \
        --initial_model /path/to/weight/ecapa_tdnn.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model MFA_Conformer \
        --initial_model /path/to/weight/mfa_conformer.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model SKA_TDNN \
        --initial_model /path/to/weight/ska_tdnn.model

`Stage 2`. Copy-synthesis Training.

In Stage 2, we augment the model with the ability to discriminate between bona fide and spoofed inputs by using large-scale data generated through an oracle speech synthesis system, referred to as copy synthesis. This repository has the copy-synthesis training using copy-synthesized data from VoxCeleb2 dev or ASVspoof2019 LA train/train+dev.

`Stage 3`. In-domain Fine-tuning.

Even though training in Stages 1 and 2 learn to discriminate bona fide non-target and spoof non-target inputs, there is a remaining domain mismatch with the evaluation protocol. Furthermore, artefacts from the acoustic model have yet to be learned. Hence, in Stage 3, we fine-tune the model using in-domain bona fide and spoofed data contained within the ASVspoof2019 LA dataset.

`Summary`. Experimental results and pre-trained weights for several models.

	Stage1	Stage2	Stage3	SASV-EER	SASV-EER	SASV-EER	SASV-EER
	ASV-based Pre-training	Copy-synthesis Training	In-domain Fine-tuning	`SKA-TDNN` `train`	`SKA-TDNN` `train+dev`	`MFA-Conformer` `train`	`MFA-Conformer` `train+dev`
1	-	-	`ASVspoof2019` `(bna+spf)`	9.55	5.94	11.47	7.67
2	`VoxCeleb2` `(bna)`	-	-	-	16.74	-	20.22
3	`VoxCeleb2` `(bna)`	-	`ASVspoof2019` `(bna+spf)`	2.67	1.25	2.13	1.51
4	-	`VoxCeleb2` `(bna+cs)`	-	-	13.11	-	14.27
5	-	`VoxCeleb2` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	2.47	1.93	1.91	1.35
6	`VoxCeleb2` `(bna)`	`VoxCeleb2` `(bna+cs)`	-	-	10.24	-	12.33
7	`VoxCeleb2` `(bna)`	`VoxCeleb2` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	1.83	1.56	1.19	1.06
8	-	`ASVspoof2019` `(bna+cs)`	-	13.10	10.49	13.63	12.48
9	-	`ASVspoof2019` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	9.57	6.17	13.46	10.11
10	`VoxCeleb2` `(bna)`	`ASVspoof2019` `(bna+cs)`	-	5.62	4.93	9.31	8.32
11	`VoxCeleb2` `(bna)`	`ASVspoof2019` `(bna+cs)`	`ASVspoof2019` `(bna+spf)`	2.48	1.44	2.72	1.76

You can download each pre-trained weight from the above links:

Citation

If you utilize this repository, please cite the following paper,

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Proc. Interspeech},
  year={2020}
}

@inproceedings{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  booktitle={Proc. Interspeech},
  year={2022}
}

@inproceedings{mun2022frequency,
  title={Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification},
  author={Mun, Sung Hwan and Jung, Jee-weon and Han, Min Hyun and Kim, Nam Soo},
  booktitle={Proc. IEEE SLT},
  year={2022}
}

spandandey21/SASV2_Baseline

Towards single integrated spoofing-aware speaker verification embeddings

Stage 1. Speaker classification-based Pre-training.

Stage 2. Copy-synthesis Training.

Stage 3. In-domain Fine-tuning.

Summary. Experimental results and pre-trained weights for several models.

Citation

`Stage 1`. Speaker classification-based Pre-training.

`Stage 2`. Copy-synthesis Training.

`Stage 3`. In-domain Fine-tuning.

`Summary`. Experimental results and pre-trained weights for several models.