
This tool extracts sentences and split audio from mp3, mp3 and wav files so that they can be used as training data for speech-to-text and text-to-speech applications, e.g.

Buy Me A Coffee



  • Run using python --dir_or_mp3_file "E:\my_audio_directory" --out_dir "output" or python --dir_or_mp3_file "E:\my_audio_file.mp3" --out_dir "output"


  • The tool creates an output folder, called out[timestamp] if you haven't provided a folder name using --out_dir. This folder contains:
    • a metadata.csv with filenames (excluding extensions) and texts
    • a wavs folder containing:
      • audio split into sentences and a corresponding txt file for each wav file with its spoken content
    • a tmp folder that keeps track of all audio files that have already been processed.


The tool is able to continue a process. Therefore, the last ID of the metadata.csv is read and used to continue the process and all wav files in the output/tmp directory are skipped. This can lead to a loss of some sentences when only a few sentences of a wav file have been transcribed.


  • This tool has only been tested with german text.
  • All other languages require other models or even other implementations of stt and gramatical restauration.