/Audio-Deepfake-Detection

Research progress on speech deepfake detection: Relevant datasets aggregated from the review literature and publicly available codes

This repository contains a list of audio deepfake resources. We also have a survey report on Audio Deepfake Detection (ADD). We include sections on ADD Datasets, Audio Preprocessing, Feature Extraction and Network Training to introduce beginners to carefully selected material to learn the ADD domain. We will endeavour to maintain this repository on an ongoing basis for a fixed period.

Table of contents

Audio Large Model

Model Publisher Years Achievable Tasks
AudioLM
Paper Website Code
Google 2022.09 1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space.
2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation.
VALL-E
Paper Website
Microsoft 2023.01 1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech.
2. VALL-E X: Cross-lingual speech synthesis.
USM
Website
Google 2023.03 1. ASR beyond 100 languages.
2. Downstream ASR tasks.
3. Automated Speech Translation (AST).
SpeechGPT
Website
Fudan University 2023.05 1. Perceive and generate multi-modal contents.
2. Spoken dialogue LLM with strong human instruction.
Pengi
Paper Website
Microsoft 2023.05 1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks.
2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions.
VoiceBox
Website
Meta 2023.06 1. Synthesize speech across six languages.
2. Remove transient noise.
3. Edit content.
4. Transfer audio style within and across languages.
5. Generate diverse speech samples.
AudioPaLM
Paper Website
Google 2023.06 1. Speech-to-speech translation.
2. Automatic Speech Recognition (ASR).

Datasets

Attack Types Years Dataset Number of Audio
(Subdataset:Real/Fake)
Language
TTS 2021 WaveFake
Paper Dataset
16283/117985 English, Japanese
TTS 2021 HAD
Paper
53612/107224 Chinese
TTS 2022 ADD 2022
Paper
LF: 5619/46067
PF: 5319/46419
FG-D: 5319/46206
Chinese
TTS 2022 CMFD
Paper Dataset
Chinese: 1800/1000
English: 1800/1000
English, Chinese
TTS 2022 In-the-Wild
Paper Dataset
19963/11816 English
TTS 2022 FAD
Paper Dataset
115800/115800 Chinese
Replay 2017 ASVspoof 2017
Paper Dataset
3565/14465 English
Replay 2019 ReMASC
Paper Dataset
9240/45472 English, Chinese, Hindi
TTS和VC 2015 AVspoof
Paper Dataset
LA: 15504/120480
PA: 15504/14465
English
TTS和VC 2015 ASVspoof 2015
Paper Dataset
16651/246500 English
TTS和VC 2021 FMFCC-A
Paper Dataset
10000/40000 Chinese
TTS和VC 2022 SceneFake
Paper Dataset
19838/64642 English
TTS和VC 2022 EmoFake
Paper
35000/53200 English, Chinese
TTS和VC 2023 PartialSpoof
Paper Dataset
12483/108978 English
TTS和VC 2023 ADD 2023
Paper
FG-D: 172819/113042
RL: 55468/65449
AR: 14907/95383
Chinese
TTS和VC 2023 DECRO
Paper Dataset
Chinese: 21218/41880
English: 12484/42799
English, Chinese
TTS、VC和Replay 2019 ASVspoof 2019
Paper Dataset
LA: 12483/108978
PA: 28890/189540
English
TTS、VC和Replay 2021 ASVspoof 2021
Paper Dataset
LA: 18452/163114
PA: 126630/816480
PF: 14869/519059
English

Audio Preprocessing

Commonly Used Noise Datasets

Dataset Description
MUSAN
Dataset
A corpus of music, speech and noise
RIR
Dataset
A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
NOIZEUS
Dataset
Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise.
NoiseX-92
Dataset
All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included.
DEMAND
Dataset
Multi-channel acoustic noise database for diverse environments.
ESC-50
Dataset
A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory).
ESC
Dataset
Including the ESC-50, ESC-10, and ESC-US.
FSD50K
Dataset
An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology.

Audio Enhancement Methods

Method Description
SpecAugment
Paper Code
Enhancement strategies include time warping, frequency masking and time masking
WavAugment
Paper Code
Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping
RawBoost
Paper Code
Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise

Feature Extraction

Handcrafted Feature-based Forgery Detection

Paper Audio Deepfake Detection Results
Data Augmentation Feature Extraction Network Framework Loss Function EER (%) t-DCF
Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge
Paper Code
CQT, Power Spectrum VGG, SincNet CE LA: 8.01 (4)
PA: 1.51 (2)
LA: 0.208 (4)
PA: 0.037 (1)
Long-term high frequency features for synthetic speech detection
Paper
Cafe, White and Street Noise ICQC, ICQCC, ICBC, ICLBC DNN CE LA: 7.78 (3) LA: 0.187 (3)
Voice spoofing countermeasure for logical access attacks detection
Paper
ELTP-LFCC DBiLSTM LA: 0.74 (1) LA: 0.008 (1)
Voice spoofing detector: A unified anti-spoofing framework
Paper
ATP-GTCC SVM Hamming
Distance
LA: 0.75 (2)
PA: 1.00 (1)
LA: 0.050 (2)
PA: 0.064 (2)
Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario.

Hybrid Feature-based Forgery Detection

Paper Audio Deepfake Detection Results
Data Augmentation Feature Extraction Network Structure Loss Function EER (%) t-DCF
Light convolutional neural network with feature genuinization for detection of synthetic speech attacks
Paper
CQT-based LPS LCNN LA: 4.07 (11) LA: 0.102 (10)
Siamese convolutional neural network using gaussian probability feature for spoofing speech detection
Paper
LFCC Siamese CNN CE LA: 3.79 (10)
PA: 7.98 (5)
LA: 0.093 (5)
PA: 0.195 (2)
Generalization of audio deepfake detection
Paper
RIR and MUSAN LFB ResNet18 LCML LA: 1.81 (4) LA: 0.052 (4)
Continual learning for fake audio detection
Paper
LFCC LCNN, DFWF Similarity Loss LA: 7.74 (15)
PA: 8.85 (6)
Partially-connected differentiable architecture search for deepfake and spoofing detection
Paper Code
Frequency Mask LFCC PC-DARTS WCE LA: 4.96 (12) LA: 0.091 (8)
One-class learning towards synthetic voice spoofing detection
Paper Code
LFCC ResNet18 OC-Softmax LA: 2.19 (7) LA: 0.059 (5)
Replay and synthetic speech detection with res2net architecture
Paper Code
CQT SE-Res2Net50 BCE LA: 2.50 (8)
PA: 0.46 (2)
LA: 0.074 (7)
PA: 0.012 (2)
An empirical study on channel effects for synthetic voice spoofing countermeasure systems
Paper Code
Telephone Codecs, and Device/Room Impulse Responses (IRs). LFCC LCNN, ResNet-OC OC-Softmax, CE LA: 3.92 (10)
Efficient attention branch network with combined loss function for automatic speaker verification spoof detection
Paper Code
SpecAug, Attention Mask LFCC EfficientNet-A0, SE-Res2Net50 WCE, Triplet Loss LA: 1.89 (6)
PA: 0.86 (4)
LA: 0.507 (11)
PA: 0.024 (4)
Resmax: Detecting voice spoofing attacks with residual network and max feature map
Paper
CQT ResMax BCE LA: 2.19 (7)
PA: 0.37 (1)
LA: 0.060 (6)
PA: 0.009 (1)
Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture
Paper
Adding noise according to a signal-to-noise ratio of 15dB or 25dB CQT SE-Res2Net34-Confromer CE LA: 1.85 (5) LA: 0.060 (6)
Fastaudio: A learnable audio front-end for spoof speech detection
Paper Code
L-VQT L-DenseNet NLLLoss LA: 1.54 (3) LA: 0.045 (3)
Learning from yourself: A self-distillation method for fake speech detection
Paper
LPS, F0 ECANet, SENet A-Softmax LA: 1.00 (2)
PA: 0.65 (3)
LA: 0.031 (2)
PA: 0.017 (3)
How to boost anti-spoofing with x-vectors
Paper
LFCC, MFCC TDNN, SENet34 LCML LA: 0.83 (1) LA: 0.024 (1)
Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario.

End-to-end Forgery Detection

Paper Audio Deepfake Detection Results
Data Augmentation Feature Extraction Network Structure Loss Function EER (%) t-DCF
A light convolutional GRU-RNN deep feature extractor for asv spoofing detection
Paper
LC-GRNN PLDA LA: 6.28 (13)
PA: 2.23
LA: 0.152 (10)
PA: 0.061
Rw-resnet: A novel speech anti-spoofing model using raw waveform
Paper
1D Convolution Residual Block ResNet CE LA: 2.98 (11) LA: 0.082 (9)
Raw differentiable architecture search for speech deepfake and spoofing detection
Paper Code
Masking Filter Sinc Filter PC-DARTS P2SGrad LA: 1.77 (10) LA: 0.052 (7)
Towards end-to-end synthetic speech detection
Paper Code
DNN Res-TSSDNet, Inc-TSSDNet WCE LA: 1.64 (9) LA: 0.048 (6)
End-to-end anti-spoofing with RawNet2
Paper Code
Sinc Filter RawNet2 CE LA: 1.12 (5) LA: 0.033 (3)
Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection
Paper
FastAudio filter X-vector, ECAPA-TDNN LA: 1.54 (7) LA: 0.045 (5)
Fully automated end-to-end fake audio detection
Paper
Sinc Filter Wav2Vec2 light-DARTS Comparative loss LA: 1.08 (4)
Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning
Paper
Sinc Filter RawNet2, SimAM AAM Softmax, MSE LA: 0.99 (3) LA: 0.029 (2)
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks
Paper Code
Sinc Filter RawNet2, MGO, HS-GAL CE LA: 0.83 (2) LA: 0.028 (1)
Ai-synthesized voice detection using neural vocoder artifacts
Paper Code
Resampling, Noise Addition Sinc Filter RawNet2 CE, Softmax LA: 4.54 (12)
To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection
Paper
RawBoost Sinc Filter RawNet2, TCN CE, Orthogonal Loss LA: 1.58 (8)
Speaker-Aware Anti-spoofing
Paper
Sinc Filter AASIST, M2S Converter CE LA: 1.13 (6) LA: 0.038 (4)
Spoofing attacker also benefits from self-supervised pretrained model
Paper
HuBERT, WavLM Residual block, Conv-TasNet AAM softmax LA: 0.44 (1)
Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.

Feature Fusion-based Forgery Detection

Paper Audio Deepfake Detection Results
Feature Extraction Network Structure Loss Function EER (%)
Voice spoofing countermeasure for synthetic speech detection
Paper
GTCC, MFCC, Spectral Flux, Spectral Centroid Bi-LSTM LA: 3.05 (4)
Combining automatic speaker verification and prosody analysis for synthetic speech detection
Paper
MFCC, Mel-Spectrogram ECAPA-TDNN, Prosody Encoder BCE LA: 5.39 (5)
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation
Paper
Sinc Filter, Wav2Vec2 AASIST Contrastive Loss, WCE
Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure
Paper
Mel-Spectrogram, CQT LCNN, ResNet LA: 1.35 (2)
PA: 0.35
Detection of cross-dataset fake audio based on prosodic and pronunciation features
Paper
Phoneme Feature, Prosody Feature, Wav2Vec2 LCNN, Bi-LSTM CTC LA: 1.58 (3)
Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion
Paper Code
Sinc Filter AASIST, M2S Converter CE LA: 1.34 (1)
Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.

Network Training

Multi-task Learning-based Forgery Detection

Paper Audio Deepfake Detection Results
Feature Extraction Network Structure Loss Function EER (%) t-DCF
Multi-task learning in utterance-level and segmental-level spoof detection
Paper
LFCC SELCNN, Bi-LSTM P2SGrad
SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system
Paper Code
Fbanks, Sinc Filter ECAPA-TDNN, ARawNet BCE, AAM Softmax, CE LA: 4.86 (4)
STATNet: Spectral and temporal features based multi-task network for audio spoofing detection
Paper
Sinc Filter RawNet2, TCM, SCM CE LA: 2.45 (3) LA: 0.062 (2)
A probabilistic fusion framework for spoofing aware speaker verification
Paper Code
Mel Filter, Sinc Filter ECAPA-TDNN, AASIST BCE LA: 1.53 (2)
DSVAE: Interpretable disentangled representation for synthetic speech detection
Paper
Spectrogram VAE KL Divergence Loss, BCE LA: 6.56 (5)
End-to-end dual-branch network towards synthetic speech detection
Paper Code
LFCC, CQT Dual-Branch Network Classification Loss, Fake Type Classification Loss LA: 0.80 (1) LA: 0.021 (1)
Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.

Reference

More details about on the above, you may check the following this papers: //: (```python)

Statement

The purpose of this project is to establish a database based on audio deepfake detection, solely for the purpose of communication and learning. All the content collected in this project is sourced from journals and the internet, and we express sincere gratitude to the researchers and authors who have published related research achievements. In the event of a complaint of copyright infringement, the content will be removed as appropriate.

Contact

We are glad to hear from you. If you have any questions, please feel free to contact xuyuxiong2022@email.szu.edu.cn.