/speech-recognition

End to End Speech Recognition with Tensorflow

Primary LanguagePython

Speech Recognition 🗣📝

End to End Speech Recognition implemented with deep learning framework Tensorflow. Build upon Recurrent Neural Networks with LSTM and CTC(Connectionist Temporal Classification).

🔨 Install

After cloning the repository, you need to install all the project dependencies etc..

$ python setup.py install

🏄‍ Run

Run it via command line where you can choose to either training or prediction phase.

Training 💪

The command for running the training phase.

$ python -m speechrecognition train -config ./config/lstm_ctc.yml

You need to provide a configuration file of the training.

Prediction 🤔

The command for running the prediction phase.

$ python -m speechrecognition predict -audio {path/to/audio-file} -config ./config/lstm_ctc.yml

The same configuration file you provided in training phase will be also applied in prediction phase (sucks, i know). Most importantly, you provide the path to the audio file in wav format, which will be transcribed to text.

Configuration File

The configuration file let's you defined properties and it sets the file paths to datset, training model and tensorboard logs.

The file is in the yaml format and this is the predefined structure.

Section Key Modify
dataset name ❗️
label_type
lang
dataset_path ❗️
feature name
feature_size
hyperparameter num_classes
num_hidden
num_layers
batch_size
num_epoches
num_iterations
dropout_prob
model model_type
tensorboard_path ❗️
trained_path ❗️
model_description
restore_trained_model

Dataset

It's currently supporting two speech datasets.

In order to train the model, you need to download your own dataset and store locally and change the paths to the dataset in the configuration file.

The Learning Model

Preprocessing

MFCC

Model

RNN/BRNN -> Dense Layer -> CTC

Tensorboard

In the configuration file is defined the path to the Tensorboard logs. By running this command on the directory, you may see the process of the training phase.

$ tensorboard --logdir {path/to/tensorboard-logs}

MI-PYT TODO:

  • Code Refactor
  • (TF dataset pipepline - GPU training speed up)
  • (Better Tensorboard monitoring)
  • (Divide to Train/Test set)
  • (Better Speech Evaluation)
  • Improved sound preprocessing and feature extraction
  • (Training model based with bidirectional RNNs)
  • Training model based on Attention Mechanism
  • Training model based on Neural Turing Machine
  • Automated generation of datasets from audiobooks
  • Documentation
  • Tests