/SASV2_Baseline

SASV2 baseline, a track on ASVspoof5 phase2 challenge

Primary LanguagePythonMIT LicenseMIT

Towards single integrated spoofing-aware speaker verification embeddings

Stage 1. Speaker classification-based Pre-training.

In Stage 1, the ability to discriminate between target and bona fide non-target speakers can be learned using the VoxCeleb2 database which contains data collected from thousands of bona fide speakers. In this repository, we provide the pre-trained weights of the following models:

Model params SASV-EER (%) SV-EER (%) SPF-EER (%)
ECAPA-TDNN 16.7M 20.66 0.74 27.30
MFA-Conformer 20.9M 20.22 0.41 26.52
SKA-TDNN 29.4M 16.74 0.38 22.38

You can evaluate the pre-trained weights using the following commands:

cd stage3

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model ECAPA_TDNN \
        --initial_model /path/to/weight/ecapa_tdnn.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model MFA_Conformer \
        --initial_model /path/to/weight/mfa_conformer.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model SKA_TDNN \
        --initial_model /path/to/weight/ska_tdnn.model

Stage 2. Copy-synthesis Training.

In Stage 2, we augment the model with the ability to discriminate between bona fide and spoofed inputs by using large-scale data generated through an oracle speech synthesis system, referred to as copy synthesis. This repository has the copy-synthesis training using copy-synthesized data from VoxCeleb2 dev or ASVspoof2019 LA train/train+dev.

Stage 3. In-domain Fine-tuning.

Even though training in Stages 1 and 2 learn to discriminate bona fide non-target and spoof non-target inputs, there is a remaining domain mismatch with the evaluation protocol. Furthermore, artefacts from the acoustic model have yet to be learned. Hence, in Stage 3, we fine-tune the model using in-domain bona fide and spoofed data contained within the ASVspoof2019 LA dataset.

Summary. Experimental results and pre-trained weights for several models.

Stage1 Stage2 Stage3 SASV-EER SASV-EER SASV-EER SASV-EER
ASV-based
Pre-training
Copy-synthesis
Training
In-domain
Fine-tuning
SKA-TDNN
train
SKA-TDNN
train+dev
MFA-Conformer
train
MFA-Conformer
train+dev
1 - - ASVspoof2019
(bna+spf)
9.55 5.94 11.47 7.67
2 VoxCeleb2
(bna)
- - - 16.74 - 20.22
3 VoxCeleb2
(bna)
- ASVspoof2019
(bna+spf)
2.67 1.25 2.13 1.51
4 - VoxCeleb2
(bna+cs)
- - 13.11 - 14.27
5 - VoxCeleb2
(bna+cs)
ASVspoof2019
(bna+spf)
2.47 1.93 1.91 1.35
6 VoxCeleb2
(bna)
VoxCeleb2
(bna+cs)
- - 10.24 - 12.33
7 VoxCeleb2
(bna)
VoxCeleb2
(bna+cs)
ASVspoof2019
(bna+spf)
1.83 1.56 1.19 1.06
8 - ASVspoof2019
(bna+cs)
- 13.10 10.49 13.63 12.48
9 - ASVspoof2019
(bna+cs)
ASVspoof2019
(bna+spf)
9.57 6.17 13.46 10.11
10 VoxCeleb2
(bna)
ASVspoof2019
(bna+cs)
- 5.62 4.93 9.31 8.32
11 VoxCeleb2
(bna)
ASVspoof2019
(bna+cs)
ASVspoof2019
(bna+spf)
2.48 1.44 2.72 1.76

You can download each pre-trained weight from the above links:

Citation

If you utilize this repository, please cite the following paper,

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Proc. Interspeech},
  year={2020}
}
@inproceedings{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  booktitle={Proc. Interspeech},
  year={2022}
}
@inproceedings{mun2022frequency,
  title={Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification},
  author={Mun, Sung Hwan and Jung, Jee-weon and Han, Min Hyun and Kim, Nam Soo},
  booktitle={Proc. IEEE SLT},
  year={2022}
}