/SpeechDenoisingDNN

Removing various types of noises present in the speech using Deep Neural Networks

Primary LanguageJupyter Notebook

Speech Enhancement using Deep Neural Networks

Introduction

Whenever we work with real time speech signals, we need to keep in mind about various types of noises that gets added to the original noise and hence resulting in corruption of noise. Therefore, in order to make a better sense of the signals, it is very much necessary to enhance the speech signals by removing the noises present in them.

Applications:

  • Automatic speech recognition
  • Speaker recognition
  • Mobile communication
  • Hearing aids

DNN based architectures

  • Autoencoder Decoder
  • Recurrent Neural Nets
  • Restricted Boltzmann Machines

Dataset:

The dataset used for this project is TCD-TIMIT speech corpus,a new Database and baseline for Noise-robust Audio-visual Speech Recognition

Description

No of speakers: high-quality audio samples of 62 speakers Total number of sentences: 6913 phonetically rich sentences Each audio sample is sampled at 16,000 Hz Three of the speakers are professionally-trained lipspeakers 6 types of Noises at range of SNR’s from -5db to 20 db Babble, Cafe, Car, Living Room, White, Street

Downloadable link for the dataset:

You can find the complete dataset here https://zenodo.org/record/260228

Approach followed:

  • Used log power spectrum of the signal as features
  • Computed STFT of the signal with nfft=256, noverlap=128, nperseg=256
  • STFT = log(abs(STFT))
  • Trained the model with the Autoencoder decoder type network with input considering 16 frames autoencoder image
  • Used Mean Square Error loss
  • Adam optimizer (default parameters)

Frameworks:

  • Keras backend

Network Overview

network overview image

Methods Implemented

  1. Frame to frame training(Input will be noisy frame matrix, output will be clean matrix)
  2. Considered the heuristic feature that the noise in the present frame depends both on the present frame and the past few frames.Based on this, trained a model considering past 7 frames and the present frame

1. Frame to frame:

  • trained the network with noisy frame as the input and the corresponding clean frame as the output.

Architecture

model 1 architecture

Model

Model 1 image

Results

  • The following waveforms are the results of the network when trained with above network

Clean Signal

Clean image

Corrupted Signal

Corrupted image

Enhanced Signal

Enhanced image

2. Network based on the past frames:

  • trained the network with past 7 noisy frames concatinated with the present frame as the input and the corresponding present frame's clean frame as the output.

Architecture

model 2 architecture

Model

Model 2 image

Results

  • The following waveforms are the results of the network when trained with above network

Clean Signal

Clean image

Corrupted Signal

Corrupted image

Enhanced Signal

Enhanced image

References

Note

  • This project is made available for research purpose only.
  • This project will be continued and updated through time.
  • The codes are available for only babble and cafe noises. The codes for rest of the noises will be almost similar.

Contact

For any further queries, contact: amballachaitanya777@gmail.com