Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.
Demo page with voiced abstract: link.
Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.
Firstly, install all Python package requirements:
pip install -r requirements.txtSecondly, build monotonic_align code (Cython):
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..Note: code is tested on Python==3.8
Requirements Python >= 3.7 + CUDA >= 11.0 + torch >= 1.13.0.
To install run:
git clone https://github.com/thu-ml/low-bit-optimizers.git
pip install -v -e .- Fill "text_cleaners" in params.py
- Edit text/symbols.py
- Remove unnecessary imports from text/cleaners.py
You can download HiFi-GAN checkpoint trained on LJSpeech* and Libri-TTS datasets (22kHz) from here.
Put necessary Grad-TTS and HiFi-GAN checkpoints into checkpts folder in root Grad-TTS directory (note: in inference.py you can change default HiFi-GAN path).
- Create text file with sentences you want to synthesize like
test.txt. - For single speaker set
params.n_spks=1and for multispeaker (Libri-TTS) inference setparams.n_spks=247. - Run script
inference.pyby providing path to the text file, path to the Grad-TTS checkpoint, number of iterations to be used for reverse diffusion (default: 10) and speaker id if you want to perform multispeaker inference:python inference.py -f <your-text-file> -c <grad-tts-checkpoint> -t <number-of-timesteps> -s <speaker-id-if-multispeaker>
- Check out folder called
outputsfor generated audios.
You can also perform interactive inference by running Jupyter Notebook inference.ipynb.
- Make filelists of your audio data like ones included into
resources/filelistsfolder. For single speaker training refer toljspeechfilelists and tolibri-ttsfilelists for multispeaker. - Set experiment configuration in
params.pyfile. - Specify your GPU device and run training script:
export CUDA_VISIBLE_DEVICES=YOUR_GPU_ID python train.py # if single speaker python train_multi_speaker.py # if multispeaker
- To track your training process run tensorboard server on any available port:
During training all logging information and checkpoints are stored in
tensorboard --logdir=YOUR_LOG_DIR
YOUR_LOG_DIR, which you can specify inparams.pybefore training.
- HiFi-GAN model is used as vocoder, official github repository: link.
- Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: link.
- text/cleaners.py ORI-Muchim/PolyLangVITS
- low-bit-optimizers
- Model Quantization(float32 -> int8) ONLY INFERENCE
