/Speech-Separation

Python scripts exploring sound separation using deep clustering

Primary LanguagePython

Speech Separation using Neural Networks and Tensorflow

Introduction

The files experiment Speech separation using various neural network structures. The experiments feed in a dataset of sound files with containing 2 clean voices and attempt to build a network to separate out the 2 voices. So far, 2 networks have been built:

  • Feed forward network
  • RNN network

The scripts were created using the Spyder IDE of anaconda. Before executing each script, set the console directory to the directory of the script.

Source

J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 31-35

Data Generator

Within the DataGenerator folder are two Python scripts that create the dataset. It is assumed that a top-level folder exists called TIMIT_WAV that contains the TIMIT dataset. The top-level folder should look something like this:

Alt text

datagenerator.py

The datagenerator.py script contains a class to create the data set. The dataset is saved as several pickle files. Each pickle file contains The pickle files are saved to a top level folder called Data.

datagenerator2.py

The datagenerator2.py takes the data from a given number of pickle files and feeds data into tensorflow session in batches.

Feed Forward network

train_net.py

The feedforward folder contains a python script called train_net.py that trains a feedforward network. The network contains 2 hidden layers of 300 neurons and an output layer of 129 neurons (one for each frequency bin in the spectrogram). The output layer uses a sigmoid activation function. A mean squared error loss function is used on a known IBM. The following schematic represents the flow of code:

Alt text

After 50 epochs, the network struggles to find any pattern in the data. The accuracy after 50 epochs is still close to 50%.

Alt text

A test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:

Alt text

After applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a feed forward network is not a good model for speech separation.

Alt text

RNN network

train_RNN.py

The RNN folder contains a python script called train_rnn.py. This scripts trains a 2 layer RNN using LSTM cells containing 300 neurons. A final feedforward layer with 129 neurons using a sigmoid activation function produces an IBM. A mean squared error loss function was used against a known IBM. The flow is shown in the following schematic:

Alt text

The network uses the same datagenerator.py class to create the data. The spectrograms are split into chunks of 100 time frequency bins which are fed into the RNN. The remainder data in a spectrogram after the nearest value of 100 is not used for training. Like the feed forward network, the network struggles to separate the two sound sources. Accuracy on the training set after 50 epochs is still almost 50%.

Alt text

As with the feed forward network, a test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:

Alt text

As with the feed forward network, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a RNN network is not a good model for speech separation.

Alt text

Bi-directional RNN network

train_bi_directional_RNN.py

The Bi-Directional-RNN folder contains a python script called train_bi_directional_RNN.py. This scripts trains a 2 layer bi-directional RNN using LSTM cells containing 300 neurons. A final feedforward layer with 129 neurons using a sigmoid activation function produces an IBM. A mean squared error loss function was used against a known IBM. The flow is shown in the following schematic:

Alt text

As with the one-directional RNN, the network uses the same datagenerator.py class to create the data. The spectrograms are split into chunks of 100 time frequency bins which are fed into the RNN. The remainder data in a spectrogram after the nearest value of 100 is not used for training. Accuracy on the training set after 50 epochs is still only 50%.

Alt text

The same test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:

Alt text

As with the other networks, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a bi-directional RNN network on its own is not a good model for speech separation.

Alt text

Bi-directional RNN network with deep clustering loss function

train_bi_with_loss_function.py

The Bi-Directional-RNN-with-loss-function folder contains a python script called train_bi_with_loss_function.py. This scripts trains the same 2 layer bi-directional RNN as before. This time, the loss function from deep clustering was implemented. The flow is shown in the following schematic:

Alt text

Accuracy on the training set after 50 epochs was erratic. However, the purpose of the loss function is to move neurons in the final layer apart.

Alt text

The same test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:

Alt text

As with the other networks, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a bi-directional RNN network on its own is not a good model for speech separation.

Alt text

Full deep clustering model with k-means clustering

train_deep_clustering.py

The full deep-clustering model in simplemented in the Deep-clustering folder within the python script called train_deep_clustering.py. The programmatic flow is shown in the following schematic:

Alt text

The bi-directional LSTM model created before creates embeddings. Test signals are then fed into these embeddings. An example of the embeddings from a test signal is shown below:

Alt text

K-means clustering is then applied to the embeddings to assign each embedding a speaker:

Alt text

The loss function is designed to move embeddings from different sources further apaert and embeddings from the same source closer together:

Alt text

The same test signal containing mixture of 2 voices as before was fed into the network. Clustering was performed on the results and the following IBM was produced:

Alt text

Below is the output of the binary mask. If enough data is fed into the network, some separation is audible (honest!):

Alt text

Full deep clustering model with mean-shift clustering

Alt text

Alt text

Alt text