Code for checking goodness of data for STT and TTS.
$ git clone https://github.com/coqui-ai/data-checker.git
$ cd data-checker
$ docker build . -t data-checker
$ docker run data-checker python data_checks.py "/code/data/smoke_test/russian_sample_data/ru.csv" 2
.
.
.
๐ โ Found 1 <transcript,clip> pairs in /code/data/smoke_test/russian_sample_data/ru.csv
ยท First audio file found: ru.wav of type audio/wav
ยท Checking if audio is readable...
๐ Found no unreadable audiofiles
ยท Reading audio duration...
๐ โ Found a total of 0.00 hours of readable data
ยท Get transcript length...
ยท Get num feature vectors...
๐ Found no audio clips over 30 seconds in length
๐ Found no transcripts under 10 characters in length
ยท Get ratio (num_feats / transcript_len)...
๐ Found no offending <transcript,clip> pairs
ยท Calculating ratio (num_feats : transcript_len)...
๐ Found no <transcript,clip> pairs more than 2.0 standard deviations from the mean
๐ โฌ Saved a total of 0.00 hours of data to BEST dataset
โ Removed a total of 0.00 hours (0.00% of original data)
โ Removed a total of 0 samples (0.00% of original data)
โ Wrote best data to /code/data/smoke_test/russian_sample_data/ru.BEST
data-checker
assumes your CSV has two columns: wav_filename
and transcript
. Note that you don't actually need to use WAV files, but the header still should be wav_filename
.
$ docker run data-checker --mount "type=bind,src=/path/to/my/local/data,dst=/mnt" python data_checks.py "/mnt/my-data.csv" 2