This is the repository containing most of the code for my thesis 'Design, Implementation and Analysis of a Deep Convolutional-Recurrent Neural Network for Speech Recognition throuth Audiovisual Sensor Fusion' at the ESAT (Electrical Engineering) Department of KU Leuven (2016-2017).
Author: Matthijs Van keirsbilck
Supervisor: Bert Moons
Promotor: Marian Verhelst
The code and thesis text are bound by the KU Leuven's Student Thesis Copyright Regulations.
The CNN-LSTM networks for lipreading are combined with LSTM networks for audio recognition through an attention mechanism.
These networks achieve state-of-the-art phoneme recognition performance on the publicly available audio-visual dataset TCD-TIMIT.
Systems that rely only audio suffer greatly when audio quality is lowered by noise, as is often the case in real-life situations.
This performance loss can be greatly mitigated by adding visual information.
The CNN-LSTM neural networks acieve 68.46% correctness compared to the 57.85% baseline.
Audio-only neural networks achieve 67.03% compared to 65.47% in the baseline.
Lipreading-audio combination networks achieve 75.70% accuracy for clean audio, and 58.55% for audio with an SNR of 0dB. The baseline multimodal network achieved 59% and 44% for clean and noisy audio, respectively.
The networks are implemented using Lasagne.
There is room for improvement of the code; I'll try to improve it if I can find the time.
For the downloading, preprocessing etc of the dataset: see https://github.com/matthijsvk/TCDTIMITprocessing
For the lipreading networks, see the folder code/lipreading
For the audio speech recognition networks, see code/audioSR
For the combination networks see code/combinedSR
Thanks to the authors of all the data and software used in this work. An inexhaustive list:
To Set up Python, I recommend using Anaconda. You can use the provided environment.yml
to install all python packages (although some aren't used anymore).
For the installation of Theano/Lasagne and CUDA, I recommend following this tutorial.
If you find this thesis or code useful, please cite according to the following bib entry
@MastersThesis{Vankeirsbilck:Thesis:2017,
author = {Matthijs Van keirsbilck},
title = {{Design, implementation and analysis of a deep convolutional-recurrent neural network for speech recognition through audiovisual sensor fusion}},
school = {KU Leuven},
address = {Belgium},
year = {2017},
}