/SASV2_Baseline

SASV2 baseline, a track on ASVspoof5 phase2 challenge

Primary LanguagePythonMIT LicenseMIT

Towards single integrated spoofing-aware speaker verification embeddings

Get started. Installation dependencies.

requirements.txt is included in each stage directory (All of them are the same so you can use any of them).

If you use the Anaconda virtual environment,

conda create -n sasv python=3.9 cudatoolkit=11.3
conda activate sasv

Install all dependency packages,

pip3 install -r requirements.txt

Stage 1. Speaker classification-based Pre-training.

In Stage 1, the ability to discriminate between target and bona fide non-target speakers can be learned using the VoxCeleb2 database which contains data collected from thousands of bona fide speakers. In this repository, we provide the pre-trained weights of the following models:

Model params SASV-EER (%) SV-EER (%) SPF-EER (%)
ECAPA-TDNN 16.7M 20.66 0.74 27.30
MFA-Conformer 20.9M 20.22 0.41 26.52
SKA-TDNN 29.4M 16.74 0.38 22.38

You can evaluate the pre-trained weights using the following commands:

cd stage3

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model ECAPA_TDNN \
        --initial_model /path/to/weight/ecapa_tdnn.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model MFA_Conformer \
        --initial_model /path/to/weight/mfa_conformer.model

python trainSASVNet.py
        --eval \
        --test_list ./protocols/ASVspoof2019.LA.asv.eval.gi.trl.txt \
        --test_path /path/to/dataset/ASVSpoof/ASVSpoof2019/LA/ASVspoof2019_LA_eval/wav \
        --model SKA_TDNN \
        --initial_model /path/to/weight/ska_tdnn.model

Stage 2. Copy-synthesis Training.

In Stage 2, we augment the model with the ability to discriminate between bona fide and spoofed inputs by using large-scale data generated through an oracle speech synthesis system, referred to as copy synthesis. This repository has the copy-synthesis training using copy-synthesized data from VoxCeleb2 dev or ASVspoof2019 LA train/train+dev.

Stage 3. In-domain Fine-tuning.

Even though training in Stages 1 and 2 learn to discriminate bona fide non-target and spoof non-target inputs, there is a remaining domain mismatch with the evaluation protocol. Furthermore, artefacts from the acoustic model have yet to be learned. Hence, in Stage 3, we fine-tune the model using in-domain bona fide and spoofed data contained within the ASVspoof2019 LA dataset.

Summary. Experimental results and pre-trained weights for several models.

Stage1 Stage2 Stage3 SASV-EER SASV-EER SASV-EER SASV-EER
ASV-based
Pre-training
Copy-synthesis
Training
In-domain
Fine-tuning
SKA-TDNN
train
SKA-TDNN
train+dev
MFA-Conformer
train
MFA-Conformer
train+dev
1 - - ASVspoof2019
(bna+spf)
9.55 5.94 11.47 7.67
2 VoxCeleb2
(bna)
- - - 16.74 - 20.22
3 VoxCeleb2
(bna)
- ASVspoof2019
(bna+spf)
2.67 1.25 2.13 1.51
4 - VoxCeleb2
(bna+cs)
- - 13.11 - 14.27
5 - VoxCeleb2
(bna+cs)
ASVspoof2019
(bna+spf)
2.47 1.93 1.91 1.35
6 VoxCeleb2
(bna)
VoxCeleb2
(bna+cs)
- - 10.24 - 12.33
7 VoxCeleb2
(bna)
VoxCeleb2
(bna+cs)
ASVspoof2019
(bna+spf)
1.83 1.56 1.19 1.06
8 - ASVspoof2019
(bna+cs)
- 13.10 10.49 13.63 12.48
9 - ASVspoof2019
(bna+cs)
ASVspoof2019
(bna+spf)
9.57 6.17 13.46 10.11
10 VoxCeleb2
(bna)
ASVspoof2019
(bna+cs)
- 5.62 4.93 9.31 8.32
11 VoxCeleb2
(bna)
ASVspoof2019
(bna+cs)
ASVspoof2019
(bna+spf)
2.48 1.44 2.72 1.76

You can download each pre-trained weight from the above links:

Citation

If you utilize this repository, please cite the following paper,

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Proc. Interspeech},
  year={2020}
}
@inproceedings{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  booktitle={Proc. Interspeech},
  year={2022}
}
@inproceedings{mun2022frequency,
  title={Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification},
  author={Mun, Sung Hwan and Jung, Jee-weon and Han, Min Hyun and Kim, Nam Soo},
  booktitle={Proc. IEEE SLT},
  year={2022}
}