/espnet

End-to-End Speech Processing Toolkit

Primary LanguagePythonApache License 2.0Apache-2.0

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 1.9.1 1.10.2 1.11.0 1.12.1
ubuntu20/python3.9/pip Github Actions
ubuntu20/python3.8/pip Github Actions
ubuntu18/python3.7/pip Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions
debian9/python3.7/conda debian9
centos7/python3.7/conda centos7
doc/python3.8 doc

PyPI version Python Versions Downloads GitHub license codecov Code style: black Imports: isort Mergify Status Gitter

Docs | Example | Example (ESPnet2) | Docker | Notebook

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
  • Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer or Branchformer
  • Attention: Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
    • Architecture:
      • RNN-based encoder and decoder.
      • Custom encoder and decoder supporting Transformer, Conformer (encoder), 1D Conv / TDNN (encoder) and causal 1D Conv (decoder) blocks.
      • VGG2L (RNN/custom encoder) and Conv2D (custom encoder) bottlenecks.
    • Search algorithms:
    • Features:
      • Multi-task learning with various auxiliary losses:
        • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
        • Decoder: cross-entropy w/ label smoothing.
      • Transfer learning with acoustic model and/or language model.
      • Training with FastEmit regularization method [Yu et al., 2021].

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pretrained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
    • Set frontend to be s3prl
    • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
  • Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
  • Restricted Self-Attention based on Longformer as an encoder for long sequences

Demonstration

TTS: Text-to-speech

  • Architecture
    • Tacotron2
    • Transformer-TTS
    • FastSpeech
    • FastSpeech2
    • Conformer FastSpeech & FastSpeech2
    • VITS
    • JETS
  • Multi-speaker & multi-language extention
    • Pretrained speaker embedding (e.g., X-vector)
    • Speaker ID embedding
    • Language ID embedding
    • Global style token (GST) embedding
    • Mix of the above embeddings
  • End-to-end training
    • End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
    • Joint training of text2mel and vocoder
  • Various language support
    • En / Jp / Zn / De / Ru / And more...
  • Integration with neural vocoders
    • Parallel WaveGAN
    • MelGAN
    • Multi-band MelGAN
    • HiFiGAN
    • StyleMelGAN
    • Mix of the above models

Demonstration

To train the neural vocoder, please check the following repositories:

NOTE:

  • We are moving on ESPnet2-based development for TTS.
  • The use of ESPnet1-TTS is deprecated, please use ESPnet2-TTS.

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domain models
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pretrained models from Asteroid
    • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

  • Interactive SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer based end-to-end ST (new!)
  • Transformer based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

SLU: Spoken Language Understanding

  • Architecture
    • Transformer based Encoder
    • Conformer based Encoder
    • Branchformer based Encoder
    • RNN based Decoder
    • Transformer based Decoder
  • Support Multitasking with ASR
    • Predict both intent and ASR transcript
  • Support Multitasking with NLU
    • Deliberation encoder based 2 pass model
  • Support using pretrained ASR models
    • Hubert
    • Wav2vec2
    • VQ-APC
    • TERA and more ...
  • Support using pretrained NLP models
    • BERT
    • MPNet And more...
  • Various language support
    • En / Jp / Zn / Nl / And more...
  • Supports using context from previous utterances
  • Supports using other tasks like SE in pipeline manner
  • Supports Two Pass SLU that combines audio and ASR transcript Demonstration
  • Performing noisy spoken language understanding using speech enhancement model followed by spoken language understanding model. Open In Colab
  • Performing two pass spoken language understanding where the second pass model attends on both acoustic and semantic information. Open In Colab
  • Integrated to Huggingface Spaces with Gradio. See SLU demo on multiple languages: Hugging Face Spaces

SUM: Speech Summarization

  • End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [Sharma et al., 2022]

SVS: Singing Voice Synthesis

  • Framework merge from Muskits
  • Architecture
    • RNN-based non-autoregressive model
    • Xiaoice
    • Sequence-to-sequence Transformer (with GLU-based encoder)
    • MLP singer
    • Tacotron-singing (in progress)
    • DiffSinger (to be published)
    • VISinger (in progress)
  • Support multi-speaker & multilingual singing synthesis
    • Speaker ID embedding
    • Language ID embedding
    • Global sytle token (GST) embedding
  • Various language support
    • Jp / En / Kr / Zh
  • Tight integration with neural vocoders (the same as TTS)

DNN Framework

  • Flexible network architecture thanks to chainer and pytorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard based monitoring

ESPnet2

See ESPnet2.

  • Independent from Kaldi/Chainer, unlike ESPnet1
  • On the fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe which can be applied for all corpora
  • Possible to train any size of corpus without CPU memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments including DNN training, then see Installation.

  • If you just need the Python module only:

    # We recommend you installing pytorch before installing espnet following https://pytorch.org/get-started/locally/
    pip install espnet
    # To install latest
    # pip install git+https://github.com/espnet/espnet
    # To install additional packages
    # pip install "espnet[all]"

    If you'll use ESPnet1, please install chainer and cupy.

    pip install chainer==6.0.0 cupy==6.0.0    # [Option]

    You might need to install some packages depending on each task. We prepared various installation scripts at tools/installers.

  • (ESPnet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

Usage

See Usage.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking times for ESPnet! Any contributions to ESPnet are welcome and feel free to ask any questions or requests to issues. If it's the first contribution to ESPnet for you, please follow the contribution guide.

Results and demo

You can find useful tutorials and demos in Interspeech 2019 Tutorial

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task CER (%) WER (%) Pretrained model
Aishell dev/test 4.6/5.1 N/A link
ESPnet2 Aishell dev/test 4.4/4.7 N/A link
Common Voice dev/test 1.7/1.8 2.2/2.3 link
CSJ eval1/eval2/eval3 5.7/3.8/4.2 N/A link
ESPnet2 CSJ eval1/eval2/eval3 4.5/3.3/3.6 N/A link
HKUST dev 23.5 N/A link
ESPnet2 HKUST dev 21.2 N/A link
Librispeech dev_clean/dev_other/test_clean/test_other N/A 1.9/4.9/2.1/4.9 link
ESPnet2 Librispeech dev_clean/dev_other/test_clean/test_other 0.6/1.5/0.6/1.4 1.7/3.4/1.8/3.6 link
Switchboard (eval2000) callhm/swbd N/A 14.0/6.8 link
TEDLIUM2 dev/test N/A 8.6/7.2 link
TEDLIUM3 dev/test N/A 9.6/7.6 link
WSJ dev93/eval92 3.2/2.1 7.0/4.7 N/A
ESPnet2 WSJ dev93/eval92 1.1/0.8 2.8/1.8 link

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

expand

You can recognize speech in a WAV file using pretrained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
tedlium2.rnn.v1 Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2 Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1 Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1 Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1 Joint-CTC attention Transformer trained on CSJ
csj.rnn.v1 Joint-CTC attention VGGBLSTM trained on CSJ

SE results

expand

We list results from three different models on WSJ0-2mix, which is one the most widely used benchmark dataset for speech separation.

Model STOI SAR SDR SIR
TF Masking 0.89 11.40 10.24 18.04
Conv-Tasnet 0.95 16.62 15.94 25.90
DPRNN-Tasnet 0.96 18.82 18.29 28.92

SE demos

expand
You can try the interactive demo with Google Colab. Please click the following button to get access to the demos.

Open In Colab

It is based on ESPnet2. Pretrained models are available for both speech enhancement and speech separation tasks.

ST results

expand

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 51.03 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 20.44 link
Libri-trans test (En->Fr) 16.70 link
How2 dev5 (En->Pt) 45.68 link
Must-C tst-COMMON (En->De) 22.91 link
Mboshi-French dev (Fr->Mboshi) 6.18 N/A

cascaded system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 42.16 N/A
Fisher-CallHome Spanish callhome_evltest (Es->En) 19.82 N/A
Libri-trans test (En->Fr) 16.96 N/A
How2 dev5 (En->Pt) 44.90 N/A
Must-C tst-COMMON (En->De) 23.65 N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

expand

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

Open In Colab


You can translate speech in a WAV file using pretrained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
fisher_callhome_spanish.transformer.v1 Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

expand
Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 61.45 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 29.86 link
Libri-trans test (En->Fr) 18.09 link
How2 dev5 (En->Pt) 58.61 link
Must-C tst-COMMON (En->De) 27.63 link
IWSLT'14 test2014 (En->De) 24.70 link
IWSLT'14 test2014 (De->En) 29.22 link
IWSLT'14 test2014 (De->En) 32.2 link
IWSLT'16 test2014 (En->De) 24.05 link
IWSLT'16 test2014 (De->En) 29.13 link

TTS results

ESPnet2

You can listen to the generated samples in the following URL.

Note that in the generation we use Griffin-Lim (wav/) and Parallel WaveGAN (wav_pwg/).

You can download pretrained models via espnet_model_zoo.

You can download pretrained vocoders via kan-bayashi/ParallelWaveGAN.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pretrained models and generated samples:

Note that in the generated samples we use the following vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on following repositories.

If you want to build your own neural vocoder, please check the above repositories. kan-bayashi/ParallelWaveGAN provides the manual about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.

Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
ljspeech.wavenet.mol.v2 EN 22.05k 80-7600 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v2 EN 22.05k 80-7600 1024 / 256 / None Parallel WaveGAN
ljspeech.melgan.v1 EN 22.05k 80-7600 1024 / 256 / None MelGAN
ljspeech.melgan.v3 EN 22.05k 80-7600 1024 / 256 / None MelGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

If you want to use the above pretrained vocoders, please exactly match the feature setting with them.

TTS demo

ESPnet2

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

  • Real-time TTS demo with ESPnet2 Open In Colab

English, Japanese, and Mandarin models are available in the demo.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis.

  • Real-time TTS demo with ESPnet1 Open In Colab

We also provide shell script to perform synthesize. Go to a recipe directory and run utils/synth_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# we use upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt

# also you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt

You can change the pretrained model as follows:

synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pretrained vocoder model as follows:

synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

WaveNet vocoder provides very high quality speech but it takes time to generate.

See more details or available models via --help.

synth_wav.sh --help

VC results

expand
  • Transformer and Tacotron2 based VC

You can listen to some samples on the demo webpage.

  • Cascade ASR+TTS as one of the baseline systems of VCC2020

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system here.

SLU results

expand

We list the performance on various SLU tasks and dataset using the metric reported in the original dataset paper

Task Dataset Metric Result Pretrained Model
Intent Classification SLURP Acc 86.3 link
Intent Classification FSC Acc 99.6 link
Intent Classification FSC Unseen Speaker Set Acc 98.6 link
Intent Classification FSC Unseen Utterance Set Acc 86.4 link
Intent Classification FSC Challenge Speaker Set Acc 97.5 link
Intent Classification FSC Challenge Utterance Set Acc 78.5 link
Intent Classification SNIPS F1 91.7 link
Intent Classification Grabo (Nl) Acc 97.2 link
Intent Classification CAT SLU MAP (Zn) Acc 78.9 link
Intent Classification Google Speech Commands Acc 98.4 link
Slot Filling SLURP SLU-F1 71.9 link
Dialogue Act Classification Switchboard Acc 67.5 link
Dialogue Act Classification Jdcinal (Jp) Acc 67.4 link
Emotion Recognition IEMOCAP Acc 69.4 link
Emotion Recognition swbd_sentiment Macro F1 61.4 link
Emotion Recognition slue_voxceleb Macro F1 44.0 link

If you want to check the results of the other recipes, please check egs2/<name_of_recipe>/asr1/RESULTS.md.

CTC Segmentation demo

ESPnet1

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As demo, we align start and end of utterances within the audio file ctc_align_test.wav, using the example script utils/asr_align_wav.sh. For preparation, set up a data directory:

cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml

cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF

Here, utt_text is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:

# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf

../../../utils/asr_align_wav.sh \
    --models ${model} \
    --align_dir ${align_dir} \
    --align_config ${align_dir}/align.yaml \
    ${wav} ${align_dir}/utt_text

Segments are written to aligned_segments as a list of file/utterance name, utterance start and end times in seconds and a confidence score. The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments

The demo script utils/ctc_align_wav.sh uses an already pretrained ASR model (see list above for more models). It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed. A full example recipe is in egs/tedlium2/align1/.

ESPnet2

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As demo, we align start and end of utterances within the audio file ctc_align_test.wav. This can be done either directly from the Python command line or using the script espnet2/bin/asr_align.py.

From the Python command line interface:

# load a model with character tokens
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
# load the example file included in the ESPnet repository
import soundfile
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
# CTC segmentation
from espnet2.bin.asr_align import CTCSegmentation
aligner = CTCSegmentation( **wsjmodel , fs=rate )
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
# utt1 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 utt 4.20 6.10 -0.4899 AND CONCENTRATE ON PROPERTY MANAGEMENT

Aligning also works with fragments of the text. For this, set the gratis_blank option that allows skipping unrelated audio sections without penalty. It's also possible to omit the utterance names at the beginning of each line, by setting kaldi_style_text to False.

aligner.set_config( gratis_blank=True, kaldi_style_text=False )
text = ["SALE OF THE HOTELS", "PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.37 1.72 -2.0651 SALE OF THE HOTELS
# utt_0001 utt 4.70 6.10 -5.0566 PROPERTY MANAGEMENT

The script espnet2/bin/asr_align.py uses a similar interface. To align utterances:

# ASR model and config files from pretrained model (e.g. from cachedir):
asr_config=<path-to-model>/config.yaml
asr_model=<path-to-model>/valid.*best.pth
# prepare the text file
wav="test_utils/ctc_align_test.wav"
text="test_utils/ctc_align_text.txt"
cat << EOF > ${text}
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE
utt5 ON PROPERTY MANAGEMENT
EOF
# obtain alignments:
python espnet2/bin/asr_align.py --asr_train_config ${asr_config} --asr_model_file ${asr_model} --audio ${wav} --text ${text}
# utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE
# utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT

The output of the script can be redirected to a segments file by adding the argument --output segments. Each line contains file/utterance name, utterance start and end times in seconds and a confidence score; optionally also the utterance text. The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-7
# here, we assume that the output was written to the file `segments`
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' segments

See the module documentation for more information. It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed.

Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the text file. See the discussion in espnet#4278 (comment).

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}
@inproceedings{li2020espnet,
  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
  pages={785--792},
  year={2021},
  organization={IEEE},
}
@article{arora2021espnet,
  title={ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet},
  author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others},
  journal={arXiv preprint arXiv:2111.14706},
  year={2021}
}