
Automatically creates/downloads alignments for multiple speech datasets, using pre-existing alignments were possible.

Primary LanguagePython


This tool is an abstraction of the Montreal Forced Aligner so it can be used as a PyTorch dataset.

libritts_100 = LibrittsDataset(
  • source_directory specifies the directory where a) the data is already present or b) you want the data to be downloaded to
  • target_directory specifies the directory you want the aligned data to be stored at

The dataset can then be used as follows:

for item in libritts_100:
  item["wav"] # the audio
  item["speaker"] # speaker key
  item["transcript"] # normalized transcript
  item["phones"] # a list of triples (start_time_in_seconds, end_time_in_seconds, phone)

The "phones" list also inclodes [SILENCE] tokens between words, which are set to a length of 0 if no silence is present. In the case of punctuation, this silence token is replaced with the corresponding punctuation token.

Supported Datasets

  • LibriTTS
  • LJSpeech
  • CommonVoice
  • GlobalPhone


  • Automatically downloads data on first run.
  • Automatically downloads and installs Montreal Forced Aligner in its own conda environment.
  • Symlinks audio files rather than copying them for alignment.
  • Adds OOV words to Lexicon.
  • Easily add your own dataset by extending AlignmentsDataset class and just implementing one method for collecting the transcripts.

Planned Features

The following features are planned in future releases, please feel free to open issues if you have further ideas.

  • Visualise Alignments in a similar style to Praat
  • Integrate with phones to allow automatic conversion to IPA phones