NeuralSP: Neural network based Speech Processing

How to install

cd tools
make KALDI=/path/to/kaldi TOOL=/path/to/save/tools

Key features

Corpus

ASR
- AISHELL-1
- AISHELL-2
- AMI
- CSJ
- LaboroTVSpeech
- Librispeech
- Switchboard (+Fisher)
- TEDLIUM2/TEDLIUM3
- TIMIT
- WSJ
LM
- Penn Tree Bank
- WikiText2

Front-end

Frame stacking
Sequence summary network [link]
SpecAugment [link]
Adaptive SpecAugment [link]

Encoder

RNN encoder
- (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
- Latency-controlled BRNN [link]
- Random state passing (RSP) [link]
Transformer encoder [link]
- Chunk hopping mechanism [link]
- Relative positional encoding [link]
- Causal mask
Conformer encoder [link]
Time-depth separable (TDS) convolution encoder [link] [line]
Gated CNN encoder (GLU) [link]

Connectionist Temporal Classification (CTC) decoder

Beam search
Shallow fusion
Forced alignment

RNN-Transducer (RNN-T) decoder [link]

Beam search
Shallow fusion

Attention-based decoder

RNN decoder
- Shallow fusion
- Cold fusion [link]
- Deep fusion [link]
- Forward-backward attention decoding [link]
- Ensemble decoding
- internal LM estimation [link]
Attention type
- location-based
- content-based
- dot-product
- GMM attention
Streaming RNN decoder specific
- Hard monotonic attention [link]
- Monotonic chunkwise attention (MoChA) [link]
- Delay constrained training (DeCoT) [link]
- Minimum latency training (MinLT) [link]
- CTC-synchronous training (CTC-ST) [link]
Transformer decoder [link]
Streaming Transformer decoder specific
- Monotonic Multihead Attention [link] [link]

Language model (LM)

RNNLM (recurrent neural network language model)
Gated convolutional LM [link]
Transformer LM
Transformer-XL LM [link]
Adaptive softmax [link]

Output units

Phoneme
Grapheme
Wordpiece (BPE, sentencepiece)
Word
Word-char mix

Multi-task learning (MTL)

Multi-task learning (MTL) with different units are supported to alleviate data sparseness.

Hybrid CTC/attention [link]
Hierarchical Attention (e.g., word attention + character attention) [link]
Hierarchical CTC (e.g., word CTC + character CTC) [link]
Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
Forward-backward attention [link]
LM objective

ASR Performance

AISHELL-1 (CER)

Model	dev	test
Conformer LAS	4.1	4.5
Transformer	5.0	5.4
Streaming MMA	5.5	6.1

AISHELL-2 (CER)

Model	test_android	test_ios	test_mic
Conformer LAS	6.1	5.5	5.9

CSJ (WER)

Model	eval1	eval2	eval3
Conformer LAS	5.7	4.4	4.9
BLSTM LAS	6.5	5.1	5.6
LC-BLSTM MoChA	7.4	5.6	6.4

Switchboard 300h (WER)

Model	SWB	CH
BLSTM LAS	9.1	18.8

Switchboard+Fisher 2000h (WER)

Model	SWB	CH
BLSTM LAS	7.8	13.8

LaboroTVSpeech (CER)

Model	dev_4k	dev	tedx-jp-10k
Conformer LAS	7.8	10.1	12.4

Librispeech (WER)

Model	dev-clean	dev-other	test-clean	test-other
Conformer LAS	1.9	4.6	2.1	4.9
Transformer	2.1	5.3	2.4	5.7
BLSTM LAS	2.5	7.2	2.6	7.5
BLSTM RNN-T	2.9	8.5	3.2	9.0
UniLSTM RNN-T	3.7	11.7	4.0	11.6
UniLSTM MoChA	4.1	11.0	4.2	11.2
LC-BLSTM RNN-T	3.3	9.8	3.5	10.2
LC-BLSTM MoChA	3.3	8.8	3.5	9.1
Streaming MMA	2.5	6.9	2.7	7.1

TEDLIUM2 (WER)

Model	dev	test
Conformer LAS	7.0	6.8
BLSTM LAS	8.1	7.5
LC-BLSTM RNN-T	8.0	7.7
LC-BLSTM MoChA	10.3	8.6
UniLSTM RNN-T	10.7	10.7
UniLSTM MoChA	13.5	11.6

WSJ (WER)

Model	test_dev93	test_eval92
BLSTM LAS	8.8	6.2

LM Performance

Penn Tree Bank (PPL)

Model	valid	test
RNNLM	87.99	86.06
+ cache=100	79.58	79.12
+ cache=500	77.36	76.94

WikiText2 (PPL)

Model	valid	test
RNNLM	104.53	98.73
+ cache=100	90.86	85.87
+ cache=2000	76.10	72.77

liuyang21cn/neural_sp

NeuralSP: Neural network based Speech Processing

How to install

Key features

Corpus

Front-end

Encoder

Connectionist Temporal Classification (CTC) decoder

RNN-Transducer (RNN-T) decoder [link]

Attention-based decoder

Language model (LM)

Output units

Multi-task learning (MTL)

ASR Performance

AISHELL-1 (CER)

AISHELL-2 (CER)

CSJ (WER)

Switchboard 300h (WER)

Switchboard+Fisher 2000h (WER)

LaboroTVSpeech (CER)

Librispeech (WER)

TEDLIUM2 (WER)

WSJ (WER)

LM Performance

Penn Tree Bank (PPL)

WikiText2 (PPL)

Reference

Dependency