RNN-T-ASR

This repository contains code for building an RNN transducer model for Automatic Speech Recognition [1]. We support LSTM and Conformer based ASR at the moment.

Requirements

torch >= 1.11.0
torchaudio >= 0.11.0
speechbrain >= 0.5.11
pandas
tqdm
transformers >= 4.18.0 URL (NOT NEEDED FOR PLAIN ASR TRAINING.)
conformer >= 0.2.5 URL

Training

Non SLURM based (for debugging)

run_debug.sh is the script for debugging, usually done on a single node. In this script:

--batch-size is the total batch size after seeing which a gradient descent update is made.
--bsz-small is the batch size per GPU. If the batch size total in all gpus (#gpu*--bsz-small) is not equal to --batch-size, then gradients are accumulated.
--save-path where to save checkpoints, (saves after every epoch by default. Edit --checkpoint-after to change).
--ckpt-path path to checkpoint to be loaded to continue training.
--train-path path where the training file lives. It should be a csv which follows a template defined at URL.
--enc-type 'lstm' OR 'conf'.
--hid-tr hidden units in the transcription network.
--hid-pr hidden units in the prediction network.
--unidirectional set this flag if training a unidirectional LSTM as the transcription network. Useful for streaming ASR.
--dont-fix-path set this flag if your csv contains the absolute path to the audio. Otherwise, don't set and edit the fix() function in data.py accordingly.

SLURM based (multiple nodes/gpus)

Run the sbatch script sbatch job_submit.sh.

--nnodes number of nodes to request.
--gpus number of gpus per node.

The folder sync is required for distributed training (DDP) as we use a shared file system to synchronize training. Always remember to DELETE sync/shared BEFORE STARTING A NEW DDP INSTANCE, otherwise the training won't start.

Decoding

We use a beam search variant proposed in [2].

mkdir asr_log in the current path if running for the first time.
sbatch run_asr.sh runs the decoding in 100 parallel nodes each node decoding 1/100 of the test set.
bash run_decode.sh is the single node variant of the above which can be used for debugging.

In the above scripts:

--test-path is the folder containing 100 csv files numbered {0..99}.csv in the same format as URL.
--decode-path where to write the decodes, should be a folder (will be created if does not exist).
--unidirectional set this flag if training a unidirectional LSTM as the transcription network.
--dont-fix-path set this flag if your csv contains the absolute path to the audio. Otherwise, don't set.

Other hyperparameters in the training and decoding scripts are self-explanatory. See --help in the argument definition in main.py and decode.py for more details.

Scoring

In compute_wer.sh, change PTH to the path for the folder containing the decodes (see above).
Run bash compute_wer.sh.

Word Error Rate will be computed and written to the end of the file named ${PTH}/full.txt which would also contain "ground truth ----> hypothesis" for all utterances in the test set.

References

[1] Alex Graves, "Sequence transduction with recurrent neural networks.", Representation Learning Workshop ICML 2012.
[2] George Saon, Zoltán Tüske and Kartik Audhkhasi, "Alignment-length synchronous decoding for RNN transducer.", ICASSP 2020.

OSU-slatelab/RNN-T-ASR