End-to-End Speech Recognition using RNN-Transducer
File description
- eval.py: rnnt joint model decode
- model.py: rnnt model, which contains acoustic / phoneme model
- model2012.py: rnnt model refer to Graves2012
- seq2seq/*: seq2seq with attention
- rnnt_np.py: rnnt loss function implementation on mxnet, support for both symbol and gluon refer to PyTorch implementation
- DataLoader.py: data process
- train.py: rnnt training script, can be initialized from CTC and PM model
- train_ctc.py: ctc training script
- train_att.py: attention training script
Directory description
- conf: kaldi feature extraction config
Reference Paper
- RNN Transducer (Graves 2012): Sequence Transduction with Recurrent Neural Networks
- RNNT joint (Graves 2013): Speech Recognition with Deep Recurrent Neural Networks
- E2E criterion comparison (Baidu 2017): Exploring Neural Transducers for End-to-End Speech Recognition
- Seq2Seq-Attention: Attention-Based Models for Speech Recognition
Run
-
Compile RNNT Loss Follow the instructions in here to compile MXNET with RNNT loss.
-
Extract feature link kaldi timit example dirs (
local
steps
utils
) excuterun.sh
to extract 40 dim fbank feature runfeature_transform.sh
to get 123 dim feature as described in Graves2013 -
Train RNNT model:
python train.py --lr 1e-3 --bi --dropout .5 --out exp/rnnt_bi_lr1e-3 --schedule
Evaluation
Default only for RNNT
- Greedy decoding:
python eval.py <path to best model parameters> --bi
- Beam search:
python eval.py <path to best model parameters> --bi --beam <beam size>
Results
-
CTC
Decode PER greedy 20.36 beam 100 20.03 -
Transducer
Decode PER greedy 20.74 beam 40 19.84
Requirements
- Python 3.6
- MxNet 1.1.0
- numpy 1.14
TODO
- beam serach accelaration
- Seq2Seq with attention