Automatic pipeline to prepare a directory full of (audio clip : transcript) file pairs for wav2letter training. Currently uses DSAlign for transcript alignment.
This project is part of Talon Research. If you find this useful, please donate.
This process works best on a Mac or Linux computer.
sudo apt install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev \
python3 python3-pip python3-venv ffmpeg wget sox
./setup
brew install python3 ffmpeg wget cmake boost sox
./setup
./wav2train input/ output/
# ./wfilter output/clips.lst > output/clips-filt.lst # not yet implemented
./wsplit output/clips.lst
-
Consumes a directory with audio and matching transcripts, such as:
input/a.wav input/a.txt input/b.wav input/b.txt
Most common audio formats (wav, flac, mp3, ogg, sph, etc) will be detected. You can mix formats in the input directory. The audio files can be any length. The only requirement is that the text file is a transcription of the audio file.
-
Finds voice activity in the audio files and time-aligns these segments to the transcription.
-
Extracts the voice segments into .flac files and creates a wav2letter-compatible clips.lst file.
-
The output at this point looks like:
output/clips/a.flac output/clips/b.flac output/clips.lst
-
[Optional] Use the
wfilter
tool to filter out "bad inputs" using a pretrained model and an error threshold../wfilter output/clips.lst > output/filter.lst
-
[Optional] Use the
wsplit
tool to auto-split a clips.lst file intodev.lst,test.lst,train.lst
../wsplit output/clips.lst # or, if you filtered: ./wsplit output/filter.lst
-
[Optional] Use the
wpiece
tool to generate word piece tokens + lexicon.# generates example.lexicon, example.tokens ./wpiece example --list output/clips.lst
# Print the transcript for each clip and play it, for debugging
./wplay output/clips.lst
# Update the paths in output/*.lst to match its current directory
# As *.lst uses absolute paths, this is useful to run after moving
# datasets around on your disk or to a new machine.
# Only works if clips are in the dirname(.lst)/clips/* directory
./wrebase output/
# Print some basic stats about a dataset, such as number of clips and total hours.
./wstat output/clips.lst
# Generate word piece vocab and lexicon from one or more lst files.
./wpiece name --list output/clips.lst
# Filter a dataset using wav2letter by emission TER (<50% in this example)
./wfilter w2l-align/ output/clips.lst 0.5 > output/filter.lst