/awesome-kaldi

This is a list of features, scripts, blogs and resources for better using Kaldi ( http://kaldi-asr.org/ )

MIT LicenseMIT

awesome-kaldi

This is a list of features, scripts, blogs and resources for better using Kaldi (http://kaldi-asr.org/). Please fill free to contribute by adding more links!

Good resources for beginners:

  1. How to start with kaldi and speech recognition - A Medium post (by me) regarding the general structure of the Kaldi project and its different parts. In my opinion you should start here.
  2. Kaldi for Dummies tutorial - The basic tutorial in the Kaldi documentation. It is really good for "hands on" experience but it is not so well explained.
  3. How to Train a Deep Neural Net Acoustic Model with Kaldi - A tutorial by Josh Meyer for specifically running Kaldi with DNN
  4. Building Speech Recognition Systems with the Kaldi Toolkit - This presentation is extremely long but also extremely helpful. Its the most complete source of information about the training process and its development.
  5. Eleanor Chodroff Kaldi Tutorial - A good in depth tutorial about the training process with a lot of code examples.
  6. Speaker Diarization with Kaldi - A tutorial about X-Vectors and Speaker Diarization.
  7. Understanding a typical Kaldi Recipe - A good article which explains what each stage of a mini_librispeech recipe does.

Good resources for more complex stuff:

  1. Some Kaldi Notes - Some advanced notes that is highly recommended to read if you want to be a more trained user.
  2. Decoding graph construction in Kaldi: A visual walkthrough - If you want to understand the different parts of the Decoding graph you should probably read this. It is required to understand those concepts for debugging your graph in the development of a new model.

Good Utils

Deep in the utils folder inside the wsj recipe there are some interesting scripts that helped me a lot during my work. Knowing all of them will probably help you a lot, here are some basic ones that you should probably start with:

  1. perturb_data_dir_speed_3way.sh - this script will help you to change the speaking speed of different utterances without creating excess files. It does this by implementing an SoX command to your wav file and copying and editing all the other files in your folder. Using this script and also the next one is a must-have in most state-of-the-art systems and will help your model to generalize better.
  2. perturb_data_dir_volume.sh - this script will do exactly the same but will change the volume of the utterances.
  3. resample_data_dir.sh - You want to make a new model for different sampling rate but you don't want to manually re-sample you entire data? this script will help you to do it, again with a SoX command.
  4. combine_data.sh - If you have multiple datasets and you want to combine all of the manually, there is no need to do it file after file. this script will take an entire data directory and will combine all the files into the same new directory.
  5. summarize_logs.pl & summarize_warnings.pl - When you run a process in Kaldi with multiple jobs, each job will have different a log file. when you are using a lot of jobs it might be hard to look at all of those logs. those scripts will help you to summarize all of the logs into one readable file.
  6. Finetune acoustic model - If you don't have a lot of data You can always train a Kaldi model from the closest domain to your domain and then take the final.mdl file and finetune it with your data.
  7. Kaldi-ONNX project by XiaoMi - A project that helps transferring the Kaldi model into ONNX so you could easily use the model in different frameworks.

Good Kaldi "production ready" examples 

There are some open-source projects around that use Kaldi as a platform for building an ASR systems for real-time usage. by seeing those projects you can learn a lot about how to implement such system of you own.

  1. online2-tcp-nnet3-decode-faster - A new excutable that was added to kaldi that creates a basic TCP server that can read the model and transcribe raw audio. If you want an easy to implement solution just to check your models easily, you should probably start here.
  2. kaldi-gstreamer-server - this is a nice project that will help you to integrate Kaldi toolkit and the GStreamer framework, a popular framework that will help you to make a scalable ASR server
  3. kaldi-offline-transcriber - A good example for a project that handles both training and decoding. It is being build for Estonian but can be easily transformed into any language.
  4. compile Kaldi for android - You can also compile the Kaldi project in a way that will work directly on android devices. That might not be a good idea with a heavy model, but can be used to more constrained models.
  5. VBDiarization - A good implementation of Speaker Diarization, it can be used with Kaldi pre-trained Xvector model.
  6. tf-kaldi-speaker - A framework that combines TensorFlow and Kaldi in the context of speaker verification/identification tasks. The project has some pretrained model that were trained on huge datasets.
  7. kaldi-adapt-lm - A tool that helps to adapt nnet3 chain models to a different language model.

Available pretrained models:

  1. Kaldi pretrained models - The models trained on Kaldi website. In english, Arabic and mandarin. They also recently added the Librespeech SOTA model.
  2. Open source speech recognition recipe and corpus for building German acoustic models with Kaldi - Trained on 700+- hours of german data. Gets updated from time to time.

Resources for understanding the math/science behind Kaldi better:

  1. Speech Recognition with Weighted Finite-State Transducers - The "bible" for understanding WFST-based systems for Speech recognition. This should probably be your first read.
  2. Semirings and WFST - A good small course (~3 hours) from Nanyang technological university that covers the idea of WFSTs in a really straight forward and visual way.
  3. The HTK book - The HTK book is for another ASR toolkit but it highlihts the basics of speech recognition in a a really intuitive and graphic way.
  4. A Bit of Progress in Language Modeling J. Goodman, 2011 - The most basic and comprehensive article about the creation of Language-Models
  5. GMM Acoustic Modeling and Feature Extraction - A really good presentation by Andrew Maas for better understanding the GMM-based phoneme alignment.

Important articles

  1. The Kaldi Speech Recognition Toolkit D. Povey et al., 2011 - The original article that described Kaldi and the different parts of the project. It should be noted that some parts of that article are outdated.
  2. A time delay neural network architecture for efficient modeling of long temporal contexts V. Peddinti, D. Povey, S. Khudanpur, 2015 - The article that describes the usage of TDNNs in Kaldi
  3. Hybrid speech recognition with Deep Bidirectional LSTM A. Graves, N. Jaitly and A. Mohamed, 2013 - an article about the BLSTM basic recipe in Kaldi.