Automatic pipeline to prepare a directory full of (audio clip : transcript) file pairs for wav2letter training. Currently uses DSAlign for transcript alignment.
This project is part of Talon Research. If you find this useful, please donate.
This process works best on a Mac or Linux computer.
sudo apt install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev \
python3 python3-pip ffmpeg wget
./setup
brew install python3 ffmpeg wget cmake boost
./setup
./wav2train input/ output/
# ./wfilter output/clips.lst > output/clips-filt.lst # not yet implemented
./wsplit output/clips.lst
-
Consumes a directory with audio and matching transcripts, such as:
input/a.wav input/a.txt input/b.wav input/b.txt
Most common audio formats (wav, flac, mp3, ogg, sph, etc) will be detected. You can mix formats in the input directory. The audio files can be any length. The only requirement is that the text file is a transcription of the audio file.
-
Finds voice activity in the audio files and time-aligns these segments to the transcription.
-
Extracts the voice segments into .flac files and creates a wav2letter-compatible clips.lst file.
-
The output at this point looks like:
output/clips/a.flac output/clips/b.flac output/clips.lst
-
[Optional] not included yet
Use thewfilter
tool to filter out "bad inputs" using a pretrained model and an error threshold../wfilter output/clips.lst > output/clips-filt.lst
-
[Optional] Use the
wsplit
tool to auto-split a clips.lst file intodev.lst,test.lst,train.lst
../wsplit output/clips.lst # or, if you filtered: ./wsplit output/clips-filt.lst
./wplay output/clips.lst # print the transcript and play each clip, for debugging