RNN-T-ASR

This repository contains code for building an RNN transducer model for Automatic Speech Recognition [1]. We support LSTM and Conformer based ASR at the moment.

Requirements

  • torch >= 1.11.0
  • torchaudio >= 0.11.0
  • speechbrain >= 0.5.11
  • pandas
  • tqdm
  • transformers >= 4.18.0 URL (NOT NEEDED FOR PLAIN ASR TRAINING.)
  • conformer >= 0.2.5 URL

Training

Non SLURM based (for debugging)

run_debug.sh is the script for debugging, usually done on a single node. In this script:

  • --batch-size is the total batch size after seeing which a gradient descent update is made.
  • --bsz-small is the batch size per GPU. If the batch size total in all gpus (#gpu*--bsz-small) is not equal to --batch-size, then gradients are accumulated.
  • --save-path where to save checkpoints, (saves after every epoch by default. Edit --checkpoint-after to change).
  • --ckpt-path path to checkpoint to be loaded to continue training.
  • --train-path path where the training file lives. It should be a csv which follows a template defined at URL.
  • --enc-type 'lstm' OR 'conf'.
  • --hid-tr hidden units in the transcription network.
  • --hid-pr hidden units in the prediction network.
  • --unidirectional set this flag if training a unidirectional LSTM as the transcription network. Useful for streaming ASR.
  • --dont-fix-path set this flag if your csv contains the absolute path to the audio. Otherwise, don't set and edit the fix() function in data.py accordingly.

SLURM based (multiple nodes/gpus)

Run the sbatch script sbatch job_submit.sh.

  • --nnodes number of nodes to request.
  • --gpus number of gpus per node.

The folder sync is required for distributed training (DDP) as we use a shared file system to synchronize training. Always remember to DELETE sync/shared BEFORE STARTING A NEW DDP INSTANCE, otherwise the training won't start.

Decoding

We use a beam search variant proposed in [2].

  • mkdir asr_log in the current path if running for the first time.
  • sbatch run_asr.sh runs the decoding in 100 parallel nodes each node decoding 1/100 of the test set.
  • bash run_decode.sh is the single node variant of the above which can be used for debugging.

In the above scripts:

  • --test-path is the folder containing 100 csv files numbered {0..99}.csv in the same format as URL.
  • --decode-path where to write the decodes, should be a folder (will be created if does not exist).
  • --unidirectional set this flag if training a unidirectional LSTM as the transcription network.
  • --dont-fix-path set this flag if your csv contains the absolute path to the audio. Otherwise, don't set.

Other hyperparameters in the training and decoding scripts are self-explanatory. See --help in the argument definition in main.py and decode.py for more details.

Scoring

In compute_wer.sh, change PTH to the path for the folder containing the decodes (see above).
Run bash compute_wer.sh.

Word Error Rate will be computed and written to the end of the file named ${PTH}/full.txt which would also contain "ground truth ----> hypothesis" for all utterances in the test set.

References

[1] Alex Graves, "Sequence transduction with recurrent neural networks.", Representation Learning Workshop ICML 2012.
[2] George Saon, Zoltán Tüske and Kartik Audhkhasi, "Alignment-length synchronous decoding for RNN transducer.", ICASSP 2020.