Automatic Speech Recognition Using TensorFlow

Code is writen in Python 2.7, TensorFlow 1.0. The high-level network structure is demonstrated in below figure.

Dataset

The dataset used in this repo is TIMIT. The training set contains 3699 utterances, while the test set contains 1347 utterances ('sa' files are removed from the original dataset to prevent bias to this system).

WAV Format Conversion

Original wav files are actually NIST format. So conversion must be made beforehand using script nist2wav.sh. But please ensure you have libsndfile installed first in your machine.

Feature Extraction

MFCC is used here to extract features out of raw sound wav. I'm using code here to calculate the features.

Model

4-layer Bi-directional GRU is used as the acoustic model, and CTC is used to calculate the loss and backpropagate the gradient to the previous network layers. Dropout and Gradient Clipping are used to prevent overfitting and gradient explosion.

PER

A PER calculation wrapper of leven edit distance is implemented (code), so based on this distance, we can calculate PER arbitrarily without using TensorFlow's sub-graph. To be specific in this case, as suggested in Speaker-independent phone recognition using hidden Markov models, we merge original 61 phonemes into 39 to gain more robust predictions.

Below figure is generated using TensorBoard during training phase.

brianlan/automatic-speech-recognition

Automatic Speech Recognition Using TensorFlow

Dataset

WAV Format Conversion

Feature Extraction

Model

PER