/speech-enhancement

An attempt to Vietnamese speech enhencement with U-net and Unet based ResNet

Primary LanguagePython

Vietnamese Speech-enhancement with Wave U-net

Visualize in WB

Open in Streamlit

Table of Contents

Abstract

Speech enhancement is the task of improving the intelligibility and quality of a speech signal that may have been corrupted with noise or distortion, causing loss of intelligibility or quality and compromising its effectiveness in communication.

The gif below represent the concept of speech enhancement, where the input audio has been processed to reduce the signal which corresponding to the noise.

Timeserie denoising

Source: Vincent Belz [1]

This repository is our assignment for Course: Speech Processing (INT3411 20), where we attempt to use U-net [2] for Speech Enhencement task and deploy a simple web application.

Dataset

Audios have many different ways to be represented, going from raw time series to time-frequency decompositions. By representing with Spectrogram which consist of 2D images representing sequences of Short Time Fourier Transform (STFT) with time and frequency as axes, and brightness representing the strength of a frequency component at each time frame, the input of the model will be noisy voice spectrogram and the grouth truth will be clean voice spectrogram.

Sound to spectrogram

Source: Vincent Belz [1]

The clean voices were approximately 10 hours of reading Vietnamese articles by us, student of Speech Processing Course at UET.

The environmental noise were gathered from ESC-50 dataset [3]. However, we only focus on 20 classes which we believe are the most relevant to daily environmental noise. These classes are:

vacuum cleaner engine keyboard typing
fireworks mouse click footsteps
clapping clock alarm car horn
door wood knock wind drinking sipping
washing machine rain rooster
snoring breathing toilet flush
clock tick laughing

We used public source by Vincent Belz [1] to transform the datasets, from audios to spectrograms. Audios were sampled at 8kHz and we extracted windows slighly above 1 second. Noises have been blended to clean voices with a randomization of the noise level (between 20% and 80%).

We publish our dataset as Artifact in this project worksplace at Weight&Bias (WB). We highly recommend to take a look what we've done at WB:

Visualize in WB

Training

Inorder to take input as mel spectrogram image of audio and output as mel spectrogram image of detected noise, we are focusing on U-net, a model was initially developed for Bio Medical Image Segmentation [2], for adapting to denoise spectrograms.

Unet for denoise spectrograms

Source: Vincent Belz [1]

The original U-net's only contains convolutional and deconvolutional (transpose convolutional). In this project, inspired by the architecture of residual block [5], we attempt to introduce some new skip connection to the Unet architecture. The architecture of the Unet based Resnet is shown below:

Unet for denoise spectrograms

However, experimence showed some downside while using this architecture than the original. We are still researching deeper into this Unet based Resnet model and the process is not finished yet. Therefore, use at your own risk.

🚀 To recreate our experiments, train a model with specific model, learning rate, dropout rate, ... by running:

$ python train.py [--args config]

optional arguments:
  -h, --help            show this help message and exit
  --run RUN             run name
  --model MODEL         initial weights path
  --dropout DROPOUT     declear dropout rate
  --epoch EPOCH         number of epoch
  --startfm STARTFM     architecture start frame
  --batchsize BATCHSIZE
                        total batch size for all GPUs (default:
  --lr LR               learning rate (default: 0.0001)
  --tuning              no plot image for tuning

Experiment tracking & Tuning hyperparameter

For tracking experiment, we used an awesome tool named Weight & Bias (W&B). W&B provided us a stuning dashboard for loging data while training, tracking experiment realtime ☕.

W&B tracking experiment

Once a again, we highly encourage you to take a look at our project workspace inorder to see more about our training process:

Visualize in WB

Weight & Bias also provided a tool named Sweep for us to tuning our hyperparameter. The tool corresponsed to start some runs and search for better hyperparameter inorder to maximize or minimize our configed goal. Therefore, we initialized 2 sweep for each model (U-net and U-net Res) to maximize their Valid IoU.

Tuning hyperparameter

Hyperparameter tuning for U-net Res

Result

While you are training, your W&B dashboard might log somet image like shown. This is mel spectrogram of the data that input and output in your model. From left to right, the 1st is the spectrogram of noisy audio, the 2nd is the spectrogram of noise ground truth, the 3rd is our model prediction.

Tuning hyperparameter

With the hyperparameter from tuning process, we have train 2 model with the config as shown:

Start frame Batch size Learning rate Dropout rate Params
(M)
IoUtrain IoUval
Unet 32 8 0.0000826 - 7.76 51.14 46.14
Unet Resnet 16 8 0.0001416 0.3 7.08 41.0 44.14

Deployment

To build a simple web application for demonstrate, we are using Streamlit, which is amazing tool for guys who don't know much about html, css and so on. With Streamlit, we can code our back-end with Python, which is very cool and easy to get started with.

Streamlit share

We have deployed our application into Streamlit.share which provided 1 CPU with 800MB RAM and 800MB delicated storage for our application.

Open in Streamlit

Note: In case you see the app is sleeping, that is because Stream.share limitation, just wake it up. Many thanks

Localhost run

In order to run our application in your own machine or vitural machine, install Streamlit and orther dependencies, run:

$ pip install -r requirements.txt

To run app Streamlit, run:

$ streamlit run app.py

After that, if you see the same thing as in this video, you are good to go. video-demo

Team member

Dung Nguyen Manh:

Nguyen Phuc Hai:

Reference

[1] Vincent Belz, "Speech-enhancement". Github:https://github.com/vbelz/Speech-enhancement.

[2] Jansson, Andreas, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar and Tillman Weyde.Singing Voice Separation with Deep U-Net Convolutional Networks. ISMIR (2017). [https://ejhumphrey.com/assets/pdf/jansson2017singing.pdf]

[3] Karol J. Piczak. 2015. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018. DOI:https://doi.org/10.1145/2733373.2806390

[4] Grais, E. M., & Plumbley, M. D. (2017, November). Single channel audio source separation using convolutional denoising autoencoders. In 2017 IEEE global conference on signal and information processing (GlobalSIP) (pp. 1265-1269). IEEE.

[5] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2015. arXiv: 1409.1556 [cs.CV].