ESPnet: end-to-end speech processing toolkit

system/pytorch ver.	1.0.1	1.1.0	1.2.0	1.3.1	1.4.0	1.5.1	1.6.0
ubuntu18/python3.8/pip
ubuntu18/python3.7/pip
ubuntu18/python3.6/conda
ubuntu20/python3.6/conda
debian9/python3.6/conda
centos7/python3.6/conda
[docs/coverage] python3.8

ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Key Features

Kaldi style complete recipe

Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
Support numbers of MT recipes (IWSLT'16, the above ST recipes etc.)
Support speech separation and recognition recipe (WSJ-2mix)
Support voice conversion recipe (VCC2020 baseline) (new!)

ASR: Automatic Speech Recognition

State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
Attention: Dot product, location-aware attention, variants of multihead
Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
Batch GPU decoding
Transducer based end-to-end ASR
- Available: RNN-Transducer, Transformer-Transducer, mixed Transformer/RNN-Transducer
- Also support: attention mechanism (RNN-decoder), pre-init w/ LM (RNN-decoder), VGG-Transformer (encoder)
CTC segmentation

TTS: Text-to-speech

Tacotron2
Transformer-TTS
FastSpeech
FastSpeech2 (in ESPnet2)
Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
Multi-speaker model with pretrained speaker embedding
Multi-speaker model with GST (in ESPnet2)
Phoneme-based training (En, Jp, and Zn)
Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)

You can try demo online now!

Real-time TTS demo with ESPnet2
Real-time TTS demo with ESPnet1

To train the neural vocoder, please check the following repositories:

NOTE:

We are moving on ESPnet2-based development for TTS.

If you are beginner, we recommend using ESPnet2-TTS.

ST: Speech Translation & MT: Machine Translation

State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
Transformer based end-to-end ST (new!)
Transformer based end-to-end MT (new!)

VC: Voice conversion

Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

DNN Framework

Flexible network architecture thanks to chainer and pytorch
Flexible front-end processing thanks to kaldiio and HDF5 support
Tensorboard based monitoring

ESPnet2

See ESPnet2.

Indepedent from Kaldi/Chainer
On the fly feature extraction and text processing when training
Multi GPUs training on single/multi nodes (Distributed training)
A template recipe which can be applied for all corpora
Possible to train any size of corpus without cpu memory error
(Under development) ESPnet Model Zoo

Installation

If you intend to do full experiments including DNN training, then see Installation.

If you just need the Python module only:

pip install espnet
# To install latest
# pip install git+https://github.com/espnet/espnet

You need to install some packages.

pip install torch
pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1
pip install torchaudio                    # [Option] If you'll use enhancement task
pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2

There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.

Usage

See Usage.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to issues. If it's the first contribution to ESPnet for you, please follow the contribution guide.

Results and demo

You can find useful tutorials and demos in Interspeech 2019 Tutorial

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task	CER (%)	WER (%)	Pretrained model
Aishell dev	6.0	N/A	link
Aishell test	6.6	N/A	same as above
Common Voice dev	1.7	2.2	link
Common Voice test	1.8	2.3	same as above
CSJ eval1	5.7	N/A	link
CSJ eval2	3.8	N/A	same as above
CSJ eval3	4.2	N/A	same as above
HKUST dev	23.5	N/A	link
Librispeech dev_clean	N/A	2.1	link
Librispeech dev_other	N/A	5.3	same as above
Librispeech test_clean	N/A	2.5	same as above
Librispeech test_other	N/A	5.5	same as above
TEDLIUM2 dev	N/A	9.3	link
TEDLIUM2 test	N/A	8.1	same as above
TEDLIUM3 dev	N/A	9.7	link
TEDLIUM3 test	N/A	8.0	same as above
WSJ dev93	3.2	7.0	N/A
WSJ eval92	2.1	4.7	N/A

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

expand

You can recognize speech in a WAV file using pretrained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model	Notes
tedlium2.rnn.v1	Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2	Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1	Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1	Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1	Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1	Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1	Joint-CTC attention Transformer trained on CSJ
csj.rnn.v1	Joint-CTC attention VGGBLSTM trained on CSJ

ST results

expand

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task	BLEU	Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En)	48.39	link
Fisher-CallHome Spanish callhome_evltest (Es->En)	18.67	link
Libri-trans test (En->Fr)	16.70	link
How2 dev5 (En->Pt)	45.68	link
Must-C tst-COMMON (En->De)	22.91	link
Mboshi-French dev (Fr->Mboshi)	6.18	N/A

cascaded system

Task	BLEU	Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En)	42.16	N/A
Fisher-CallHome Spanish callhome_evltest (Es->En)	19.82	N/A
Libri-trans test (En->Fr)	16.96	N/A
How2 dev5 (En->Pt)	44.90	N/A
Must-C tst-COMMON (En->De)	23.65	N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

expand

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

You can translate speech in a WAV file using pretrained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model	Notes
fisher_callhome_spanish.transformer.v1	Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

expand

Task	BLEU	Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En)	61.45	link
Fisher-CallHome Spanish callhome_evltest (Es->En)	29.86	link
Libri-trans test (En->Fr)	18.09	link
How2 dev5 (En->Pt)	58.61	link
Must-C tst-COMMON (En->De)	27.63	link
IWSLT'14 test2014 (En->De)	24.70	link
IWSLT'14 test2014 (De->En)	29.22	link
IWSLT'16 test2014 (En->De)	24.05	link
IWSLT'16 test2014 (De->En)	29.13	link

TTS results

expand

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pretrained models and generated samples:

Note that in the generated samples we use the following vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on following repositories.

kan-bayashi/ParallelWaveGAN: Parallel WaveGAN / MelGAN / Multi-band MelGAN
r9y9/wavenet_vocoder: 16 bit mixture of Logistics WaveNet vocoder
kan-bayashi/PytorchWaveNetVocoder: 8 bit Softmax WaveNet Vocoder with the noise shaping

If you want to build your own neural vocoder, please check the above repositories. kan-bayashi/ParallelWaveGAN provides the manual about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.

Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!

Model link	Lang	Fs [Hz]	Mel range [Hz]	FFT / Shift / Win [pt]	Model type
ljspeech.wavenet.softmax.ns.v1	EN	22.05k	None	1024 / 256 / None	Softmax WaveNet
ljspeech.wavenet.mol.v1	EN	22.05k	None	1024 / 256 / None	MoL WaveNet
ljspeech.parallel_wavegan.v1	EN	22.05k	None	1024 / 256 / None	Parallel WaveGAN
ljspeech.wavenet.mol.v2	EN	22.05k	80-7600	1024 / 256 / None	MoL WaveNet
ljspeech.parallel_wavegan.v2	EN	22.05k	80-7600	1024 / 256 / None	Parallel WaveGAN
ljspeech.melgan.v1 (EXPERIMENTAL)	EN	22.05k	80-7600	1024 / 256 / None	MelGAN
ljspeech.melgan.v3 (EXPERIMENTAL)	EN	22.05k	80-7600	1024 / 256 / None	MelGAN
libritts.wavenet.mol.v1	EN	24k	None	1024 / 256 / None	MoL WaveNet
jsut.wavenet.mol.v1	JP	24k	80-7600	2048 / 300 / 1200	MoL WaveNet
jsut.parallel_wavegan.v1	JP	24k	80-7600	2048 / 300 / 1200	Parallel WaveGAN
csmsc.wavenet.mol.v1	ZH	24k	80-7600	2048 / 300 / 1200	MoL WaveNet
csmsc.parallel_wavegan.v1	ZH	24k	80-7600	2048 / 300 / 1200	Parallel WaveGAN

If you want to use the above pretrained vocoders, please exactly match the feature setting with them.

TTS demo

expand

We made a new real-time E2E-TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

Real-time TTS demo with ESPnet2
Real-time TTS demo with ESPnet1

You can synthesize speech in a TXT file using pretrained models. Go to a recipe directory and run utils/synth_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# we use upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt

# also you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt

You can change the pretrained model as follows:

synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pretrained vocoder model as follows:

synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

WaveNet vocoder provides very high quality speech but it takes time to generate.

Important Note:

This code does not include text frontend part. Please clean the input text manually. Also, you need to modify feature configuration according to the model. Default setting is for ljspeech models, so if you want to use other pretrained models, please modify the parameters by yourself. For our provided models, you can find them in the below table.

If you are beginner, instead of this script, I strongly recommend trying the colab notebook at first, which includes all of the procedure from text frontend, feature generation, and waveform generation.

Available pretrained models in the demo script are listed as follows:

Model link	Lang	Fs [Hz]	Mel range [Hz]	FFT / Shift / Win [pt]	Input	R	Model type
ljspeech.tacotron2.v1	EN	22.05k	None	1024 / 256 / None	char	2	Tacotron 2
ljspeech.tacotron2.v2	EN	22.05k	None	1024 / 256 / None	char	1	Tacotron 2 + forward attention
ljspeech.tacotron2.v3	EN	22.05k	None	1024 / 256 / None	char	1	Tacotron 2 + guided attention loss
ljspeech.transformer.v1	EN	22.05k	None	1024 / 256 / None	char	1	Deep Transformer
ljspeech.transformer.v2	EN	22.05k	None	1024 / 256 / None	char	3	Shallow Transformer
ljspeech.transformer.v3	EN	22.05k	None	1024 / 256 / None	phn	1	Deep Transformer
ljspeech.fastspeech.v1	EN	22.05k	None	1024 / 256 / None	char	1	FF-Transformer
ljspeech.fastspeech.v2	EN	22.05k	None	1024 / 256 / None	char	1	FF-Transformer + CNN in FFT block
ljspeech.fastspeech.v3	EN	22.05k	None	1024 / 256 / None	phn	1	FF-Transformer + CNN in FFT block + postnet
libritts.tacotron2.v1	EN	24k	80-7600	1024 / 256 / None	char	2	Multi-speaker Tacotron 2
libritts.transformer.v1	EN	24k	80-7600	1024 / 256 / None	char	2	Multi-speaker Transformer
jsut.tacotron2	JP	24k	80-7600	2048 / 300 / 1200	phn	2	Tacotron 2
jsut.transformer	JP	24k	80-7600	2048 / 300 / 1200	phn	3	Shallow Transformer
csmsc.transformer.v1	ZH	24k	80-7600	2048 / 300 / 1200	pinyin	1	Deep Transformer
csmsc.fastspeech.v3	ZH	24k	80-7600	2048 / 300 / 1200	pinyin	1	FF-Transformer + CNN in FFT block + postnet

Available pretrained vocoder models in the demo script are listed as follows:

Model link	Lang	Fs [Hz]	Mel range [Hz]	FFT / Shift / Win [pt]	Model type
ljspeech.wavenet.softmax.ns.v1	EN	22.05k	None	1024 / 256 / None	Softmax WaveNet
ljspeech.wavenet.mol.v1	EN	22.05k	None	1024 / 256 / None	MoL WaveNet
ljspeech.parallel_wavegan.v1	EN	22.05k	None	1024 / 256 / None	Parallel WaveGAN
libritts.wavenet.mol.v1	EN	24k	None	1024 / 256 / None	MoL WaveNet
jsut.wavenet.mol.v1	JP	24k	80-7600	2048 / 300 / 1200	MoL WaveNet
jsut.parallel_wavegan.v1	JP	24k	80-7600	2048 / 300 / 1200	Parallel WaveGAN
csmsc.wavenet.mol.v1	ZH	24k	80-7600	2048 / 300 / 1200	MoL WaveNet
csmsc.parallel_wavegan.v1	ZH	24k	80-7600	2048 / 300 / 1200	Parallel WaveGAN

VC results

Transformer and Tacotron2 based VC

You can listen to some samples on the demo webpage.

Cascade ASR+TTS as one of the baseline systems of VCC2020

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system here.

CTC Segmentation demo

expand

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the "labels" of speech datasets.

As demo, we align start and end of utterances within the audio file ctc_align_test.wav, using the example script utils/ctc_align_wav.sh. For preparation, set up a data directory:

cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml

cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF

Here, utt_text is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:

# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf

../../../utils/asr_align_wav.sh \
    --models ${model} \
    --align_dir ${align_dir} \
    --align_config ${align_dir}/align.yaml \
    ${wav} ${align_dir}/utt_text

Segments are written to aligned_segments as a list of file/utterance name, utterance start and end times in seconds and a confidence score. The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments

The demo script utils/ctc_align_wav.sh uses an already pretrained ASR model (see list above for more models). The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed. A full example recipe is in egs/tedlium2/align1/.

References

[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit," Proc. Interspeech'18, pp. 2207-2211 (2018)

[2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," Proc. ICASSP'17, pp. 4835--4839 (2017)

[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, Dec. 2017

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}

liuyanfeier/espnet

ESPnet: end-to-end speech processing toolkit

Key Features

Kaldi style complete recipe

ASR: Automatic Speech Recognition

TTS: Text-to-speech

ST: Speech Translation & MT: Machine Translation

VC: Voice conversion

DNN Framework

ESPnet2

Installation

Usage

Docker Container

Contribution

Results and demo

ASR results

ASR demo

ST results

end-to-end system

cascaded system

ST demo

MT results

TTS results

TTS demo

VC results

CTC Segmentation demo

References

Citations