NeuralSP: Neural network based Speech Processing

How to install

# Set path to CUDA, NCCL
CUDAROOT=/usr/local/cuda
NCCL_ROOT=/usr/local/nccl

export CPATH=$NCCL_ROOT/include:$CPATH
export LD_LIBRARY_PATH=$NCCL_ROOT/lib/:$CUDAROOT/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=$NCCL_ROOT/lib/:$LIBRARY_PATH
export CUDA_HOME=$CUDAROOT
export CUDA_PATH=$CUDAROOT
export CPATH=$CUDA_PATH/include:$CPATH  # for warp-rnnt

# Install miniconda, python libraries, and other tools
cd tools
make KALDI=/path/to/kaldi

Key features

Corpus

ASR
- AISHELL-1
- CSJ
- Librispeech
- Switchboard (+ Fisher)
- TEDLIUM2/TEDLIUM3
- TIMIT
- WSJ
LM
- Penn Tree Bank
- WikiText2

Front-end

Frame stacking
Sequence summary network [link]
SpecAugment [link]
Adaptive SpecAugment [link]

Encoder

RNN encoder
- (CNN-)BLSTM, (CNN-)LSTM, (CNN-)BLGRU, (CNN-)LGRU
- Latency-controlled BLSTM [link]
Transformer encoder [link]
- (CNN-)Transformer
- Chunk hopping mechanism [link]
- Relative positional encoding [link]
Time-depth separable (TDS) convolution encoder [link] [line]
Gated CNN encoder (GLU) [link]
Conformer encoder [link]

Connectionist Temporal Classification (CTC) decoder

Forced alignment
Beam search
Shallow fusion

Attention-based decoder

RNN decoder
- Shallow fusion
- Cold fusion [link]
- Deep fusion [link]
- Forward-backward attention decoding [link]
- Ensemble decoding
Streaming RNN decoder
- Hard monotonic attention [link]
- Monotonic chunkwise attention (MoChA) [link]
- CTC-synchronous training (CTC-ST) [link]
RNN transducer [link]
Transformer decoder [link]
Streaming Transformer decoder
- Monotonic Multihead Attention [link] [link]

Language model (LM)

RNNLM (recurrent neural network language model)
Gated convolutional LM [link]
Transformer LM
Transformer-XL LM [link]
Adaptive softmax [link]

Output units

Phoneme
Grapheme
Wordpiece (BPE, sentencepiece)
Word
Word-char mix

Multi-task learning (MTL)

Multi-task learning (MTL) with different units are supported to alleviate data sparseness.

Hybrid CTC/attention [link]
Hierarchical Attention (e.g., word attention + character attention) [link]
Hierarchical CTC (e.g., word CTC + character CTC) [link]
Hierarchical CTC+Attention (e.g., word attention + character CTC) [link]
Forward-backward attention [link]
LM objective

ASR Performance

AISHELL-1 (CER)

model	dev	test
Transformer	5.0	5.4
Conformer	4.7	5.2
Streaming MMA	5.5	6.1

CSJ (WER)

model	eval1	eval2	eval3
LAS	6.5	5.1	5.6

Switchboard 300h (WER)

model	SWB	CH
LAS	9.1	18.8

Switchboard+Fisher 2000h (WER)

model	SWB	CH
LAS	7.8	13.8

Librispeech (WER)

model	dev-clean	dev-other	test-clean	test-other
Transformer	2.1	5.3	2.4	5.7
Streaming MMA	2.5	6.9	2.7	7.1

TEDLIUM2 (WER)

model	dev	test
LAS	10.9	11.2

WSJ (WER)

model	test_dev93	test_eval92
LAS	8.8	6.2

LM Performance

Penn Tree Bank (PPL)

model	valid	test
RNNLM	87.99	86.06
+ cache=100	79.58	79.12
+ cache=500	77.36	76.94

WikiText2 (PPL)

model	valid	test
RNNLM	104.53	98.73
+ cache=100	90.86	85.87
+ cache=2000	76.10	72.77

sdqdlgj/neural_sp

NeuralSP: Neural network based Speech Processing

How to install

Key features

Corpus

Front-end

Encoder

Connectionist Temporal Classification (CTC) decoder

Attention-based decoder

Language model (LM)

Output units

Multi-task learning (MTL)

ASR Performance

AISHELL-1 (CER)

CSJ (WER)

Switchboard 300h (WER)

Switchboard+Fisher 2000h (WER)

Librispeech (WER)

TEDLIUM2 (WER)

WSJ (WER)

LM Performance

Penn Tree Bank (PPL)

WikiText2 (PPL)

Reference

Dependency