NeuralSP: Neural network based Speech Processing

How to install

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make KALDI=/path/to/kaldi

Key features

Corpus

ASR
- AISHELL-1
- CSJ
- Librispeech
- Switchboard (+ Fisher)
- TEDLIUM2/TEDLIUM3
- TIMIT
- WSJ
LM
- Penn Tree Bank
- WikiText2

Front-end

Frame stacking
Sequence summary network [link]
SpecAugment [link]
Adaptive SpecAugment [link]

Encoder

RNN encoder
- (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
- Latency-controlled BRNN [link]
- Random state passing (RSP) [link]
Transformer encoder [link]
- Chunk hopping mechanism [link]
- Relative positional encoding [link]
- Causal mask
Conformer encoder [link]
Time-depth separable (TDS) convolution encoder [link] [line]
Gated CNN encoder (GLU) [link]

Connectionist Temporal Classification (CTC) decoder

Beam search
Shallow fusion
Forced alignment

RNN-Transducer (RNN-T) decoder [link]

Beam search
Shallow fusion

Attention-based decoder

RNN decoder
- Shallow fusion
- Cold fusion [link]
- Deep fusion [link]
- Forward-backward attention decoding [link]
- Ensemble decoding
Attention type
- location-based
- content-based
- dot-product
- GMM attention
Streaming RNN decoder specific
- Hard monotonic attention [link]
- Monotonic chunkwise attention (MoChA) [link]
- Delay constrained training (DeCoT) [link]
- Minimum latency training (MinLT) [link]
- CTC-synchronous training (CTC-ST) [link]
Transformer decoder [link]
Streaming Transformer decoder specific
- Monotonic Multihead Attention [link] [link]

Language model (LM)

RNNLM (recurrent neural network language model)
Gated convolutional LM [link]
Transformer LM
Transformer-XL LM [link]
Adaptive softmax [link]

Output units

Phoneme
Grapheme
Wordpiece (BPE, sentencepiece)
Word
Word-char mix

Multi-task learning (MTL)

Multi-task learning (MTL) with different units are supported to alleviate data sparseness.

Hybrid CTC/attention [link]
Hierarchical Attention (e.g., word attention + character attention) [link]
Hierarchical CTC (e.g., word CTC + character CTC) [link]
Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
Forward-backward attention [link]
LM objective

ASR Performance

AISHELL-1 (CER)

Model	dev	test
Transformer	5.0	5.4
Conformer	4.7	5.2
Streaming MMA	5.5	6.1

CSJ (WER)

Model	eval1	eval2	eval3
BLSTM LAS	6.5	5.1	5.6
LC-BLSTM MoChA	7.4	5.6	6.4

Switchboard 300h (WER)

Model	SWB	CH
BLSTM LAS	9.1	18.8

Switchboard+Fisher 2000h (WER)

Model	SWB	CH
BLSTM LAS	7.8	13.8

Librispeech (WER)

Model	dev-clean	dev-other	test-clean	test-other
BLSTM LAS	2.5	7.2	2.6	7.5
BLSTM RNN-T	2.9	8.5	3.2	9.0
Transformer	2.1	5.3	2.4	5.7
UniLSTM RNN-T	3.7	11.7	4.0	11.6
UniLSTM MoChA	4.1	11.0	4.2	11.2
LC-BLSTM RNN-T	3.3	9.8	3.5	10.2
LC-BLSTM MoChA	3.3	8.8	3.5	9.1
Streaming MMA	2.5	6.9	2.7	7.1

TEDLIUM2 (WER)

Model	dev	test
BLSTM LAS	8.1	7.5
LC-BLSTM RNN-T	8.9	8.5
LC-BLSTM MoChA	10.6	8.6
UniLSTM RNN-T	11.6	11.7
UniLSTM MoChA	13.6	11.6

WSJ (WER)

Model	test_dev93	test_eval92
BLSTM LAS	8.8	6.2

LM Performance

Penn Tree Bank (PPL)

Model	valid	test
RNNLM	87.99	86.06
+ cache=100	79.58	79.12
+ cache=500	77.36	76.94

WikiText2 (PPL)

Model	valid	test
RNNLM	104.53	98.73
+ cache=100	90.86	85.87
+ cache=2000	76.10	72.77

hajime9652/neural_sp

NeuralSP: Neural network based Speech Processing

How to install

Key features

Corpus

Front-end

Encoder

Connectionist Temporal Classification (CTC) decoder

RNN-Transducer (RNN-T) decoder [link]

Attention-based decoder

Language model (LM)

Output units

Multi-task learning (MTL)

ASR Performance

AISHELL-1 (CER)

CSJ (WER)

Switchboard 300h (WER)

Switchboard+Fisher 2000h (WER)

Librispeech (WER)

TEDLIUM2 (WER)

WSJ (WER)

LM Performance

Penn Tree Bank (PPL)

WikiText2 (PPL)

Reference

Dependency