/neural_sp

End-to-end ASR/LM implementation with PyTorch

Primary LanguagePythonApache License 2.0Apache-2.0

Build Status

NeuralSP: Neural network based Speech Processing

How to install

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make KALDI=/path/to/kaldi

Key features

Corpus

  • ASR

    • AISHELL-1
    • CSJ
    • Librispeech
    • Switchboard (+ Fisher)
    • TEDLIUM2/TEDLIUM3
    • TIMIT
    • WSJ
  • LM

    • Penn Tree Bank
    • WikiText2

Front-end

  • Frame stacking
  • Sequence summary network [link]
  • SpecAugment [link]
  • Adaptive SpecAugment [link]

Encoder

  • RNN encoder
    • (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
    • Latency-controlled BLSTM [link]
  • Transformer encoder [link]
    • (CNN-)Transformer
    • Chunk hopping mechanism [link]
    • Relative positional encoding [link]
  • Time-depth separable (TDS) convolution encoder [link] [line]
  • Gated CNN encoder (GLU) [link]
  • Conformer encoder [link]

Connectionist Temporal Classification (CTC) decoder

  • Forced alignment
  • Beam search
  • Shallow fusion

Attention-based decoder

  • RNN decoder
    • Shallow fusion
    • Cold fusion [link]
    • Deep fusion [link]
    • Forward-backward attention decoding [link]
    • Ensemble decoding
  • Streaming RNN decoder
    • Hard monotonic attention [link]
    • Monotonic chunkwise attention (MoChA) [link]
    • CTC-synchronous training (CTC-ST) [link]
  • RNN transducer [link]
  • Transformer decoder [link]
  • Streaming Transformer decoder
    • Monotonic Multihead Attention [link] [link]

Language model (LM)

  • RNNLM (recurrent neural network language model)
  • Gated convolutional LM [link]
  • Transformer LM
  • Transformer-XL LM [link]
  • Adaptive softmax [link]

Output units

  • Phoneme
  • Grapheme
  • Wordpiece (BPE, sentencepiece)
  • Word
  • Word-char mix

Multi-task learning (MTL)

Multi-task learning (MTL) with different units are supported to alleviate data sparseness.

  • Hybrid CTC/attention [link]
  • Hierarchical Attention (e.g., word attention + character attention) [link]
  • Hierarchical CTC (e.g., word CTC + character CTC) [link]
  • Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
  • Forward-backward attention [link]
  • LM objective

ASR Performance

AISHELL-1 (CER)

model dev test
Transformer 5.0 5.4
Conformer 4.7 5.2
Streaming MMA 5.5 6.1

CSJ (WER)

model eval1 eval2 eval3
LAS 6.5 5.1 5.6

Switchboard 300h (WER)

model SWB CH
LAS 9.1 18.8

Switchboard+Fisher 2000h (WER)

model SWB CH
LAS 7.8 13.8

Librispeech (WER)

model dev-clean dev-other test-clean test-other
Transformer 2.1 5.3 2.4 5.7
Streaming MMA 2.5 6.9 2.7 7.1

TEDLIUM2 (WER)

model dev test
LAS 10.9 11.2

WSJ (WER)

model test_dev93 test_eval92
LAS 8.8 6.2

LM Performance

Penn Tree Bank (PPL)

model valid test
RNNLM 87.99 86.06
+ cache=100 79.58 79.12
+ cache=500 77.36 76.94

WikiText2 (PPL)

model valid test
RNNLM 104.53 98.73
+ cache=100 90.86 85.87
+ cache=2000 76.10 72.77

Reference

Dependency