/radtts

Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

Primary LanguageRoffMIT LicenseMIT

Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found here, with relevant works listed here.

This repository can be used to train the following models:

  • A normalizing-flow bipartite architecture for mapping text to mel spectrograms
  • A variant of the above, conditioned on F0 and Energy
  • Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy
  • A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training

HiFi-GAN vocoder pre-trained models

We provide a checkpoint and config for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors here, .

RADTTS pre-trained models

Model name Description Dataset
RADTTS++DAP-LJS RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors LJSpeech Dataset

We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned!

Setup

  1. Clone this repo: git clone https://github.com/NVIDIA/RADTTS.git
  2. Install python requirements or build docker image
    • Install python requirements: pip install -r requirements.txt
  3. Update the filelists inside the filelists folder and json configs to point to your data
    • basedir – the folder containing the filelists and the audiodir
    • audiodir – name of the audiodir
    • filelist| (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds

Training RADTTS (without pitch and energy conditioning)

  1. Train the decoder
    python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir
  2. Further train with the duration predictor python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"

Training RADTTS++ (with pitch and energy conditioning)

  1. Train the decoder
    python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir
  2. Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
    python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt

Training starting from a pre-trained model, ignoring the speaker embedding table

  1. Download our pre-trained model
  2. python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt

Multi-GPU (distributed)

  1. python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir

Inference demo

  1. python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/

Inference Voice Conversion demo

  1. python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"

Config Files

Filename Description Nota bene
config_ljs_decoder.json Config for the decoder conditioned on F0 and Energy
config_ljs_radtts.json Config for the decoder not conditioned on F0 and Energy
config_ljs_agap.json Config for the Autoregressive Flow Attribute Predictors Requires at least pre-trained alignment module
config_ljs_bgap.json Config for the Bi-Partite Flow Attribute Predictors Requires at least pre-trained alignment module
config_ljs_dap.json Config for the Deterministic Attribute Predictors Requires at least pre-trained alignment module

LICENSE

Unless otherwise specified, the source code within this repository is provided under the MIT License

Acknowledgements

The code in this repository is heavily inspired by or makes use of source code from the following works:

Relevant Papers

Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
One TTS Alignment to Rule Them All. ICASSP 2022

Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows. Technical Report