End to End Speech Recognition implemented with deep learning framework Tensorflow. Build upon Recurrent Neural Networks with LSTM and CTC(Connectionist Temporal Classification).
After cloning the repository, you need to install all the project dependencies etc..
$ python setup.py install
Run it via command line where you can choose to either training or prediction phase.
The command for running the training phase.
$ python -m speechrecognition train -config ./config/lstm_ctc.yml
You need to provide a configuration file of the training.
The command for running the prediction phase.
$ python -m speechrecognition predict -audio {path/to/audio-file} -config ./config/lstm_ctc.yml
The same configuration file you provided in training phase will be also applied in prediction phase (sucks, i know). Most importantly, you provide the path to the audio file in wav format, which will be transcribed to text.
The configuration file let's you defined properties and it sets the file paths to datset, training model and tensorboard logs.
The file is in the yaml
format and this is the predefined structure.
Section | Key | Modify |
---|---|---|
dataset | name | ❗️ |
label_type | ||
lang | ||
dataset_path | ❗️ | |
feature | name | |
feature_size | ️ | |
hyperparameter | num_classes | |
num_hidden | ||
num_layers | ||
batch_size | ||
num_epoches | ||
num_iterations | ||
dropout_prob | ||
model | model_type | |
tensorboard_path | ❗️ | |
trained_path | ❗️ | |
model_description | ||
restore_trained_model |
It's currently supporting two speech datasets.
- FreeSpokenDigits (1GB)
- VCTK Corpus (15GB)
In order to train the model, you need to download your own dataset and store locally and change the paths to the dataset in the configuration file.
MFCC
RNN/BRNN -> Dense Layer -> CTC
In the configuration file is defined the path to the Tensorboard logs. By running this command on the directory, you may see the process of the training phase.
$ tensorboard --logdir {path/to/tensorboard-logs}
MI-PYT TODO:
- Code Refactor
- (TF dataset pipepline - GPU training speed up)
- (Better Tensorboard monitoring)
- (Divide to Train/Test set)
- (Better Speech Evaluation)
- Improved sound preprocessing and feature extraction
- (Training model based with bidirectional RNNs)
- Training model based on Attention Mechanism
- Training model based on Neural Turing Machine
- Automated generation of datasets from audiobooks
- Documentation
- Tests