A tool aims to transform audio to mel-spectrogram for speech dataset.
You can use it to prepare the data for training acoustic model(eg. Tacotron2)/vocoder(eg. MelGAN, Hifi-GAN).
-
Download or prepare your speech datasets.
-
Transform them from audio to mel-spectrogram(saved as numpy files).
python preprocess.py --dataset=DataBaker --indir=path/BZNSYP --outdir=./training_data
-
To support more dataset type, you can define additional processing script in "./datasets/", just refer to "ljspeech.py" & "databaker.py".(Welcome to commit !)
-
The data will be processed as:
outdir/
|--train.txt (the format can be modified in preprocess.py(def write_metadata))
|--audio/
|--mels/
|--linear/
-
"MultiSets" is used for multi-speaker or multilingual dataset.
-
"config.json" is used to extract mel-spectrogram under different acoustic parameters, we provide 16k and 22k as reference("./datasets/config16k.json").
-
Linear spectrograms require very large memory, if not need, you could delete the "linears/" in the outdir.
-
Now support dataset params:
DataBaker(BZNSYP):https://www.data-baker.com/#/data/index/source
AIShell-3:https://www.openslr.org/resources/93/data_aishell3.tgz
LJSpeech:https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2