This repository contrains implementations of end-to-end ASR system by LAS, CTC(w/o attention), and transducer(w/o attention).
- torch >= 1.5.1
- torchtext >= 0.6.0
- torchaudio
- warp-rnnt
Model | train/dev loss | train/dev per | Epoch |
---|---|---|---|
CTC | 0.64/1.03 | 0.20/0.315 | 178 |
Transducer | 12.0/- | -/0.2662 | 13 |
Pretrained Transducer | 0.7/- | -/0.2670 | 195 |
LAS |
language model train/dev loss: 2.68/2.80 train/dev ppl: 14.5/16.49 epoch: 292
- Smaller vocabulary (due to phoneme mapping6) improves performance.
- VGG Feature extractor7 (ResNet even better) helps model to converge fast.
- Transducer converges faster and generalizes better than ctc.
- Weight noise8 is a useful regularizer for RNN/LSTM.
- Batch normalization helps model to converge fast.
- pretrained transducer
- LAS
- beam search
- hybrid
- add visualize script plot.py
- A Comparison of Sequence-to-Sequence Models for Speech Recognition [Ref]
- Deep Learning for Human Language Processing (2020,Spring) [Ref]
- Alexander-H-Liu/End-to-end-ASR-Pytorch [Ref]
- Open Source Korean End-to-end Automatic Speech Recognition [Ref]
- Language Translation With TorchText [Ref]
- End-to-end automatic speech recognition system implemented in TensorFlow [Ref]
- Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM [Ref]
- Speech Recognition with Deep Recurrent Neural Networks [Ref]
- pretrained embedding [Ref]
- Thanks to warp-rnnt, a PyTorch bindings for CUDA-Warp RNN-Transducer. Note that it is better installed from source code.
- Thanks to warp-transducer, a more general implementation of RNN transducer. Carefully set the environment variables as refered here before run
python setup.py install
.