/exkaldi

An advance kaldi wrapper for Pyhton

Primary LanguageC++Apache License 2.0Apache-2.0

ExKaldi: A Python-based Extension Tool of Kaldi

exkaldi_ubuntu_build

ExKaldi automatic speech recognition toolkit is developed to build an interface between Kaldi ASR toolkit and Python. Differing from other Kaldi wrappers, ExKaldi have these features:

  1. Integrated APIs to build a ASR systems, including feature extraction, GMM-HMM acoustic model training, N-Grams language model training, decoding and scoring.
  2. ExKaldi provides tools to support train DNN acoustic model with Deep Learning frameworks, such as Tensorflow.
  3. ExKaldi supports CTC decoding.

The goal of ExKaldi is to help developers build high-performance ASR systems with Python language easily.

Installation

Current version: 1.3.5. (We only tested our toolkit on Ubuntu >= 16., python3.6,python3.7,python3.8 with gh-action)

  1. If you have not installed Kaldi ASR toolkit, clone the Kaldi ASR toolkit repository firstly (Kaldi version 5.5 is expected.)
git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream

And follow these three tutorial files to install and compile it.

less kaldi/INSTALL
less kaldi/tools/INSTALL
less kaldi/src/INSTALL
  1. Clone the ExKaldi source code from our github project, then install it.

Install with pip

$ pip install https://github.com/kpu/kenlm/archive/master.zip
$ pip install exkaldi

Install with Source

$ git clone https://github.com/wangyu09/exkaldi.git
$ cd exkaldi
$ bash quick_install.sh
  1. Check if it is installed correctly.
python3 -c "import exkaldi"

Tutorial

In exkaldi/tutorials directory, we prepared a simple tutorial to show how to use ExKaldi APIs to build a ASR system from the scratch. The data is from librispeech train_100_clean dataset. This tutorial includes:

  1. Extract and process MFCC feature.
  2. Train and querying a N-grams language model.
  3. Train monophone GMM-HMM, build decision tree, and train triphone GMM-HMM.
  4. Train a DNN acoustic model with Tensorflow.
  5. Compile WFST decoding graph.
  6. Decode based on GMM-HMM and DNN-HMM.
  7. Process lattice and compute WER score.

This ASR symtem built here is just a dummy model, and we have done some formal experiments in exkaldi/examples. Check the source code or documents to look more information about APIs.

Experiments

We have done some experiments to test ExKaldi toolkit, and they achieved a good performance.

TIMIT

1, The perplexity of various language models. All these systems are trained with TIMIT train dataset and tested with TIMIT test data. The score showed in the table is PPL score.

2-grams 3-grams 4-grams 5-grams 6-grams
Kaldi baseline irstlm 14.41 --- --- --- ---
ExKaldi srilm 14.42 13.05 13.67 14.30 14.53
ExKaldi kenlm 14.39 12.75 12.75 12.70 12.25

2, The phone error rate (PER) of various GMM-HMM-based systems. All these systems are trained with TIMIT train dataset and tested with TIMIT test dataset. The Language model backend used in ExKaldi is KenLM. From the results, we can know than KenLm is avaliable to optimize the language model. And what's more, with ExKaldi, we cherry-picked the N-grams model by testing the perplexity and it improved the performance of ASR system.

mono tri1 tri2 tri3
Kaldi baseline 2grams 32.54 26.17 23.63 21.54
ExKaldi 2grams 32.53 25.89 23.63 21.43
ExKaldi 6grams 29.83 24.07 22.40 20.01

3, The phone error rate (PER) of two DNN-HMM-based systems. We trained our models with Tensorflow 2.3. The version of PyTorch-Kaldi toolkit is 1.0 in our experiment.

DNN LSTM
Kaldi baseline 18.67 ---
PyTorch-Kaldi 17.99 17.01
ExKaldi 15.13 15.01