Speech Recognition 🗣📝

End to End Speech Recognition implemented with deep learning framework Tensorflow. Build upon Recurrent Neural Networks with LSTM and CTC(Connectionist Temporal Classification).

🔨 Install

After cloning the repository, you need to install all the project dependencies etc..

$ python setup.py install

🏄‍ Run

Run it via command line where you can choose to either training or prediction phase.

Training 💪

The command for running the training phase.

$ python -m speechrecognition train -config ./config/lstm_ctc.yml

You need to provide a configuration file of the training.

Prediction 🤔

The command for running the prediction phase.

$ python -m speechrecognition predict -audio {path/to/audio-file} -config ./config/lstm_ctc.yml

The same configuration file you provided in training phase will be also applied in prediction phase (sucks, i know). Most importantly, you provide the path to the audio file in wav format, which will be transcribed to text.

Configuration File

The configuration file let's you defined properties and it sets the file paths to datset, training model and tensorboard logs.

The file is in the yaml format and this is the predefined structure.

Section	Key	Modify
dataset	name	❗️
	label_type
	lang
	dataset_path	❗️
feature	name
	feature_size	️
hyperparameter	num_classes
	num_hidden
	num_layers
	batch_size
	num_epoches
	num_iterations
	dropout_prob
model	model_type
	tensorboard_path	❗️
	trained_path	❗️
	model_description
	restore_trained_model

Dataset

It's currently supporting two speech datasets.

FreeSpokenDigits (1GB)
VCTK Corpus (15GB)

In order to train the model, you need to download your own dataset and store locally and change the paths to the dataset in the configuration file.

The Learning Model

Preprocessing

MFCC

Model

RNN/BRNN -> Dense Layer -> CTC

Tensorboard

In the configuration file is defined the path to the Tensorboard logs. By running this command on the directory, you may see the process of the training phase.

$ tensorboard --logdir {path/to/tensorboard-logs}

MI-PYT TODO:

zvadaadam/speech-recognition