RNN-Transducer Speech Recognition
End-to-end speech recognition using RNN-Transducer in Tensorflow 2.0
Overview
This speech recognition model is based off Google's Streaming End-to-end Speech Recognition For Mobile Devices research paper and is implemented in Python 3 using Tensorflow 2.0
NOTE: If you are not training using docker you must run the following commands + setup the loss function (instructions for this can be found in
warp-transducer/tensorflow_binding
)
Setup Your Environment
To setup your environment, run the following command:
git clone --recurse https://github.com/noahchalifour/rnnt-speech-recognition.git
cd rnnt-speech-recognition
pip install tensorflow==2.1.0 # or tensorflow-gpu==2.1.0 for GPU support
pip install -r requirements.txt
Common Voice
You can find and download the Common Voice dataset here
Convert all MP3s to WAVs
Before you can train a model on the Common Voice dataset, you must first convert all the audio mp3 filetypes to wavs. Do so by running the following command:
NOTE: Make sure you have
ffmpeg
installed on your computer, as it uses that to convert mp3 to wav
./scripts/common_voice_convert.sh <data_dir>
python scripts/remove_missing_samples.py \
--data_dir <data_dir> \
--replace_old
Preprocessing dataset
After converting all the mp3s to wavs you need to preprocess the dataset, you can do so by running the following command:
python preprocess_common_voice.py \
--data_dir <data_dir> \
--output_dir <preprocessed_dir>
Training a model
To train a simple model, run the following command:
python run_common_voice.py \
--mode train \
--data_dir <path to common voice directory>
Pretrained Model
Due to financial restrictions, I don't have the money to train a high quality model. If anybody is willing to train a model, you can send it to me and I will put it up here and give you credit. (chalifournoah@gmail.com)