Awesome Speech Enhancement
This is a curated list of awesome Speech Enhancement tutorials, papers, libraries, datasets, tools, scripts and results. The purpose of this repo is to organize the world’s resources for speech enhancement, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide )
Link
Language
Description
SETK
Python & C++
SETK: Speech Enhancement Tools integrated with Kaldi.
pyAudioAnalysis
Python
Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
Beamformer
Python
Implementation of the mask-based adaptive beamformer (MVDR, GEVD, MCWF).
Time-frequency Mask
Python
Computation of the time-frequency mask (PSM, IRM, IBM, IAM, ...) as the neural network training labels.
SSL
Python
Implementation of Sound Source Localization.
Data format
Python
Format tranform between Kaldi, Numpy and Matlab.
Link
Language
Description
PESQ etc.
Matlab
Evaluation for PESQ, CSIG, CBAK, COVL, STOI
SNR, LSD
Python
Evaluation for signal-to-noise-ratio and log-spectral-distortion.
SDR
Matlab
Evaluation for signal-to-distortion-ratio.
Audio feature extraction
Link
Language
Description
LPS
Python
Extract log-power-spectrum/magnitude spectrum/log-magnitude spectrum/Cepstral mean and variance normalization.
MFCC
Python
This library provides common speech features for ASR including MFCCs and filterbank energies.
pyAudioAnalysis
Python
Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
Link
Language
Description
Data simulation
Python
Add reverberation, noise or mix speaker.
RIR simulation
Python
Generation of the room impluse response (RIR) using image method.
pyroomacoustics
Python
Pyroomacoustics is a package for audio signal processing for indoor applications.
gpuRIR
Python
Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python
Python
Room impulse response simulator using python
Speech ehancement datasets (sorted by usage frequency in paper)
Name
Utterances
Speakers
Language
Pricing
Additional information
Dataset by University of Edinburgh (2016)
35K+
86
English
Free
Noisy speech database for training speech enhancement algorithms and TTS models.
TIMIT (1993)
6K+
630
English
$250.00
The TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK (2009)
43K+
109
English
Free
Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
WSJ0 (1993)
--
149
English
$1500
The WSJ database was generated from a machine-readable corpus of Wall Street Journal news text.
LibriSpeech (2015)
292K
2K+
English
Free
Large-scale (1000 hours) corpus of read English speech.
CHiME series (~2020)
--
--
English
Free
The database is published by CHiME Speech Separation and Recognition Challenge.
Augmentation noise sources (sorted by usage frequency in paper)
Name
Noise types
Pricing
Additional information
DEMAND (2013)
18
Free
Diverse Environments Multichannel Acoustic Noise Database provides a set of recordings that allow testing of algorithms using real-world noise in a variety of settings.
115 Noise (2015)
115
Free
The noise bank for simulate noisy data with clean speech. For N1-N100 noises, they were collected by Guoning Hu and the other 15 home-made noise types by USTC.
NoiseX-92 (1996)
15
Free
Database of recording of various noises available on 2 CDROMs.
STOA results in dataset by University of Edinburgh . The following methods are all trained by "trainset_28spk" and tested by common testset. ("F" denotes frequency-domain and "T" is time-domain.)
Methods
Publish
Domain
PESQ
CSIG
CBAK
COVL
SegSNR
STOI
Noisy
--
--
1.97
3.35
2.44
2.63
1.68
0.91
Wiener
--
--
2.22
3.23
2.68
2.67
5.07
--
SEGAN
INTERSPEECH 2017
T
2.16
3.48
2.94
2.80
7.73
0.93
CNN-GAN
APSIPA 2018
F
2.34
3.55
2.95
2.92
--
0.93
WaveUnet
arxiv 2018
T
2.40
3.52
3.24
2.96
9.97
--
WaveNet
ICASSP 2018
T
--
3.62
3.24
2.98
--
--
U-net
ISMIR 2017
F
2.48
3.65
3.21
3.05
9.34
--
MSE-GAN
ICASSP 2018
F
2.53
3.80
3.12
3.14
--
0.93
DFL
INTERSPEECH 2019
T
--
3.86
3.33
3.22
--
--
DFL reimplemented
ICLR 2019
T
2.51
3.79
3.27
3.14
9.86
--
TasNet
TASLP 2019
T
2.57
3.80
3.29
3.18
9.65
--
MDPhD
arxiv 2018
T&F
2.70
3.85
3.39
3.27
10.22
--
Complex U-net
INTERSPEECH 2019
F
3.24
4.34
4.10
3.81
16.85
--
Complex U-net reimplemented
arxiv 2019
F
2.87
4.12
3.47
3.51
9.96
--
SDR-PRSQ
arxiv 2019
F
3.01
4.09
3.54
3.55
10.44
RHRnet
ICASSP 2020
T
3.20
4.37
4.02
3.82
14.71
0.98
A Study on WaveNet, GANs and General CNNRNN Architectures, 2019 [link]
Deep learning: method and applications, 2016 [link]
Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, 2016 [link]
Robust automatic speech recognition by Jinyu Li and Li Deng, 2015 [link]
CCF speech seminar 2020 [link]
Real-time Single-channel Speech Enhancement with Recurrent Neural Networks by Microsoft Research, 2019 [link]
Deep learning in speech by Hongyi Li, 2019 [link]
High-Accuracy Neural-Network Models for Speech Enhancement, 2017 [link]
DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation, 2015 [link]
Microphone array signal processing: beyond the beamformer,2011 [link]
Deep learning in speech by Hongyi Li, 2019 [link]
Learning-based approach to speech enhancement and separation (INTERSPEECH tutorial, 2016) [link]
Deep learning for speech/language processing (INTERSPEECH tutorial by Li Deng, 2015) [link]
Speech enhancement algorithms (Stanford University, 2013) [link]