/Awesome-Speech-Enhancement

A tutorial for Speech Enhancement researchers and practitioners. The purpose of this repo is to organize the world’s resources for speech enhancement and make them universally accessible and useful.

MIT LicenseMIT

Awesome Maintenance Commit Activity Last Commit Contribution GitHub license

Awesome Speech Enhancement

Table of contents

Overview

This is a curated list of awesome Speech Enhancement tutorials, papers, libraries, datasets, tools, scripts and results. The purpose of this repo is to organize the world’s resources for speech enhancement, and make them universally accessible and useful.

To add items to this page, simply send a pull request.

Publications

Coming soon...

Survey

  • A literature survey on single channel speech enhancement, 2020 [paper]
  • Supervised speech separation based on deep learning: An Overview, 2017 [paper]
  • A review on speech enhancement techniques, 2015 [paper]
  • Nonlinear speech enhancement: an overview, 2007 [paper]

Feature augmentation

  • Speech enhancement using self-adaptation and multi-head attention, ICASSP 2020 [paper]
  • PAN: phoneme-aware network for monaural speech enhancement, ICASSP 2020 [paper]
  • Noise tokens: learning neural noise templates for environment-aware speech enhancement [paper]
  • Speaker-aware deep denoising autoencoder with embedded speaker identity for speech enhancement, Interspeech 2019 [paper]

Network design

Filter design
  • Efficient trainable front-ends for neural speech enhancement, ICASSP 2020 [paper]
Fusion techniques
  • Masking and inpainting: a two-stage speech enhancement approach for low snr and non-stationary noise, ICASSP 2020 [paper]
  • A composite dnn architecture for speech enhancement, ICASSP 2020 [paper]
  • Multi-domain processing via hybrid denoising networks for speech enhancement, 2018 [paper]
Attention
  • Speech enhancement using self-adaptation and multi-head attention, ICASSP 2020 [paper]
  • Channel-attention dense u-net for multichannel speech enhancement, ICASSP 2020 [paper]
  • T-GSA: transformer with gaussian-weighted self-attention for speech enhancement, ICASSP 2020 [paper]
U-net
  • Phase-aware speech enhancement with deep complex u-net, ICLR 2019 [paper] [code]
GAN
  • PAGAN: a phase-adapted generative adversarial networks for speech enhancement, ICASSP 2020 [paper
  • Time-frequency masking-based speech enhancement using generative adversarial network, ICASSP 2018 [paper]
  • SEGAN: speech enhancement generative adversarial network, Interspeech 2017 [paper]

Phase reconstruction

  • Phase reconstruction based on recurrent phase unwrapping with deep neural networks, ICASSP 2020 [paper]
  • PAGAN: a phase-adapted generative adversarial networks for speech enhancement, ICASSP 2020 [paper
  • Invertible dnn-based nonlinear time-frequency transform for speech enhancement, ICASSP 2020 [paper]
  • Phase-aware speech enhancement with deep complex u-net, ICLR 2019 [paper] [code]

Learning strategy

Loss function
  • Speech denoising with deep feature losses, Interspeech 2019 [paper]
  • End-to-end multi-task denoising for joint sdr and pesq optimization, Arxiv 2019 [paper]
Multi-task learning
Curriculum learning

Other improvements

  • Improving robustness of deep learning based monaural speech enhancement against processing artifacts, ICASSP 2020 [paper]

Tools

Framework

Link Language Description
SETK Python & C++ SETK: Speech Enhancement Tools integrated with Kaldi.
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
Beamformer Python Implementation of the mask-based adaptive beamformer (MVDR, GEVD, MCWF).
Time-frequency Mask Python Computation of the time-frequency mask (PSM, IRM, IBM, IAM, ...) as the neural network training labels.
SSL Python Implementation of Sound Source Localization.
Data format Python Format tranform between Kaldi, Numpy and Matlab.

Evaluation

Link Language Description
PESQ etc. Matlab Evaluation for PESQ, CSIG, CBAK, COVL, STOI
SNR, LSD Python Evaluation for signal-to-noise-ratio and log-spectral-distortion.
SDR Matlab Evaluation for signal-to-distortion-ratio.

Audio feature extraction

Link Language Description
LPS Python Extract log-power-spectrum/magnitude spectrum/log-magnitude spectrum/Cepstral mean and variance normalization.
MFCC GitHub stars Python This library provides common speech features for ASR including MFCCs and filterbank energies.
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

Link Language Description
Data simulation Python Add reverberation, noise or mix speaker.
RIR simulation Python Generation of the room impluse response (RIR) using image method.
pyroomacoustics GitHub stars Python Pyroomacoustics is a package for audio signal processing for indoor applications.
gpuRIR GitHub stars Python Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python GitHub stars Python Room impulse response simulator using python

Datasets

Speech ehancement datasets (sorted by usage frequency in paper)

Name Utterances Speakers Language Pricing Additional information
Dataset by University of Edinburgh (2016) 35K+ 86 English Free Noisy speech database for training speech enhancement algorithms and TTS models.
TIMIT (1993) 6K+ 630 English $250.00 The TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK (2009) 43K+ 109 English Free Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
WSJ0 (1993) -- 149 English $1500 The WSJ database was generated from a machine-readable corpus of Wall Street Journal news text.
LibriSpeech (2015) 292K 2K+ English Free Large-scale (1000 hours) corpus of read English speech.
CHiME series (~2020) -- -- English Free The database is published by CHiME Speech Separation and Recognition Challenge.

Augmentation noise sources (sorted by usage frequency in paper)

Name Noise types Pricing Additional information
DEMAND (2013) 18 Free Diverse Environments Multichannel Acoustic Noise Database provides a set of recordings that allow testing of algorithms using real-world noise in a variety of settings.
115 Noise (2015) 115 Free The noise bank for simulate noisy data with clean speech. For N1-N100 noises, they were collected by Guoning Hu and the other 15 home-made noise types by USTC.
NoiseX-92 (1996) 15 Free Database of recording of various noises available on 2 CDROMs.

SOTA results

STOA results in dataset by University of Edinburgh. The following methods are all trained by "trainset_28spk" and tested by common testset. ("F" denotes frequency-domain and "T" is time-domain.)

Methods Publish Domain PESQ CSIG CBAK COVL SegSNR STOI
Noisy -- -- 1.97 3.35 2.44 2.63 1.68 0.91
Wiener -- -- 2.22 3.23 2.68 2.67 5.07 --
SEGAN INTERSPEECH 2017 T 2.16 3.48 2.94 2.80 7.73 0.93
CNN-GAN APSIPA 2018 F 2.34 3.55 2.95 2.92 -- 0.93
WaveUnet arxiv 2018 T 2.40 3.52 3.24 2.96 9.97 --
WaveNet ICASSP 2018 T -- 3.62 3.24 2.98 -- --
U-net ISMIR 2017 F 2.48 3.65 3.21 3.05 9.34 --
MSE-GAN ICASSP 2018 F 2.53 3.80 3.12 3.14 -- 0.93
DFL INTERSPEECH 2019 T -- 3.86 3.33 3.22 -- --
DFL reimplemented ICLR 2019 T 2.51 3.79 3.27 3.14 9.86 --
TasNet TASLP 2019 T 2.57 3.80 3.29 3.18 9.65 --
MDPhD arxiv 2018 T&F 2.70 3.85 3.39 3.27 10.22 --
Complex U-net INTERSPEECH 2019 F 3.24 4.34 4.10 3.81 16.85 --
Complex U-net reimplemented arxiv 2019 F 2.87 4.12 3.47 3.51 9.96 --
SDR-PRSQ arxiv 2019 F 3.01 4.09 3.54 3.55 10.44
RHRnet ICASSP 2020 T 3.20 4.37 4.02 3.82 14.71 0.98

Learning materials

Book or thesis

  • A Study on WaveNet, GANs and General CNNRNN Architectures, 2019 [link]
  • Deep learning: method and applications, 2016 [link]
  • Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, 2016 [link]
  • Robust automatic speech recognition by Jinyu Li and Li Deng, 2015 [link]

Video

  • CCF speech seminar 2020 [link]
  • Real-time Single-channel Speech Enhancement with Recurrent Neural Networks by Microsoft Research, 2019 [link]
  • Deep learning in speech by Hongyi Li, 2019 [link]
  • High-Accuracy Neural-Network Models for Speech Enhancement, 2017 [link]
  • DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation, 2015 [link]
  • Microphone array signal processing: beyond the beamformer,2011 [link]

Slides

  • Deep learning in speech by Hongyi Li, 2019 [link]
  • Learning-based approach to speech enhancement and separation (INTERSPEECH tutorial, 2016) [link]
  • Deep learning for speech/language processing (INTERSPEECH tutorial by Li Deng, 2015) [link]
  • Speech enhancement algorithms (Stanford University, 2013) [link]