Speech Enhancement in TTS systems

This is a winter of code research project aimed at speech enhancement of speech generated by text-to-speech models.

The speech generated by many TTS models have some ambient noise and noise like artifacts. We want to work on post processing to reduce and remove those artifacts. Along with removing the noise, we also wish to quantify the noise and how clear our audio becomes after we apply our method. So we would also be interested to develop metrics for quantifying the speech clarity.

Datasets-

We use the following datasets for testing our methods-

NOIZEUS dataset
Skit TTS dataset

Speech Enhancement Methods

Speech enhancement methods can be broadly classified into two categories -

Signal Processing

This uses traditional analytical filters which remove noise, either by assuming an additive noise, or by assuming an orthogonal direct sum decomposition of the noise into the clean and the pure noise signals. Statistical techniques like MMSE, MAP, MLE estimation fall into this category as well.

We implemented and tested the following methods-

Kalman Filter
Wiener Filter
Oversubtraction/ Spectral Flooring
Bayesian MMSE Filter
Bayesian MMSE Log Filter

Deep Learning

These are relatively new and advanced and are based on training. Some popular examples of this are the Facebook Denoiser, SeGAN and RNN-Noise.

Metrics

We implement and test the following metrics-

Perceptual Evaluation of Speech Quality (PESQ, narrow and wide band)
Short-Time Objective Intelligibility (STOI)
F0 Frame Error (FFE)
Gross Pitch Error (GPE)
Mel Cepstral Distortion (MCD, both versions)
Voicing Error Decision (VED)
Mean Speech Distortion (MSD)
Word Error Rate (WER)

These filter methods work really well on the NOIZEUS dataset, however they are not the best when it comes to TTS models. The results of applying the filters on the NOIZUES as well as the TTS dataset has been discussed in detail here. Hence we need to resort to deep learning methods!

This repo can be used for real life speech denoisement purposes. Most importantly, it provides implementations of crucial metrics which can be used for measuring the amount of distortion/clarity of the speech.

Installation

To install simply clone the repository and install the requirements

git clone https://github.com/skit-ai/woc-tts-enhancement
cd woc-tts-enhancement
pip install -r requirements.txt