/VoiceFilter

Unofficial Keras implementation of Google AI VoiceFilter

Primary LanguageJupyter Notebook

Voice Filter

This is a Tensorflow/Keras implementation of Google AI VoiceFilter.

Our work is inspired from the the academic paper : https://arxiv.org/abs/1810.04826

The implementation is based on the work : https://github.com/mindslab-ai/voicefilter


Team Members

  1. Angshuman Saikia

  2. Abhinav Jain

  3. Yashwardhan Gautam


Introduction

We intend to improve the accuracy of Automatic speech recognition(ASR) by separating the speech of the primary speaker. This project has immense application in chatbots, voice assistants, video conferencing.


Who is our primary speaker ?

All users of a service will have to record their voice print during enrolment. The voice print associated with the account is used to identify the primary speaker.

How is voice print recorded ?

A audio clip is processed by a separately trained deep neural network to generate a speaker discriminative embedding. As a result, all speakers are represented by a vector of length 256.


How to prepare Dataset ?

We use the publicly available speech dataset - Librispeech. We select a primary and secondary speaker at random. For the primary speaker, select a random speech for reference and a random speech for input. Select a random speech of the secondary speaker. The input speeches of primary and secondary users are mixed which serves as one of the input. The reference speech is passed through a pre trained model ( Source: https://github.com/mindslab-ai/voicefilter ) to create an embedding which serves as the other input. The output is the input speech of the primary speaker. The speeches are not used directly. Instead, they are converted into magnitude spectrogram before being fed into a deep neural network. We have used python's librosa library to perform all audio related functions.

We created a dataset of 29351 samples that have been divided into 8 parts for ease of use with limited RAM. Link to the kaggle dataset: https://www.kaggle.com/abhinavjain02/speech-separation


Stats on Prepared Data

It took around 11 hours to prepare the dataset on Google Colab. The code is present in the dataset folder.

Note: All ordered pairs of primary and secondary speakers are unique

Stat/Dataset Train Dev Test
Total no. of unique speeches available in LibriSpeech Dataset 28539 2703 2620
No. of unique speeches used 26869 1878 1838
Percentage of total speeches used 94.15 % 69.48 % 70.15 %
Total no. of samples prepared 29351 934 964
No. of samples with same primary and reference speech 376 (1.28 %) 10 (1.07 %) 11 (1.14 %)

Proposed System Architecture


Requirements

  • This code was tested on Python 3.6.9 with Google Colab.

    Other packages can be installed by:

    pip install -r requirements.txt
    

Model


The model architecture is precisely as per the academic paper mentioned above. The model takes a input spectrogram and d vector(embedding) as input and produces a soft mask which when superimposed on the input spectrogram produces the output spectrogram. The output spectrogram is combined with the input phase to re create the primary speakers audio from the mixed input speech.

Loss Function Optimizer Metrics
Mean Squared Error (MSE) adam Sound to Distortion Ratio(SDR)


Training

  • The model was trained on Google Colab for 30 epochs.
  • Training took about 37 hours on NVIDIA Tesla P100 GPU.

Results

  • Loss

  • Validation SDR

  • Test

Note: The following results are based on model weights after 29th epoch( Peak SDR on validation )

Loss SDR
0.0104 5.3250

Audio Samples


Key learnings:

  • Processing Audio data using librosa
  • Creating flexible architechtures using Keras functional API
  • Using custom generator in keras
  • Using custom callbacks in keras
  • Multi-Processing in python

App Snippet