Tacotron 2 (with HiFi-GAN)

PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.

This implementation includes distributed and automatic mixed precision support and uses the RUSLAN dataset.

Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.

Generated samples

https://soundcloud.com/andrey-nikishaev/sets/russian-tts-nvidia-tacotron2

New

Added Diagonal guided attention (DGA) from another model https://arxiv.org/abs/1710.08969
Added Maximizing Mutual Information for Tacotron (MMI) https://arxiv.org/abs/1909.01145
- Can't make it work as showed in paper
- DGA still gives better results, and much cleaner
Added Russian text preparation with simple stress dictionary (za'mok i zamo'k)
Using HiFi GAN

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Download and extract the RUSLAN dataset
Clone this repo: git clone https://github.com/NVIDIA/tacotron2.git
CD into this repo: cd tacotron2
Install PyTorch 1.0
Install Apex
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored

Download our published Ruslan Model or LJ Speech model
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

Download our published Ruslan Model or LJ Speech model
Download published HiFi-GAN Model (Universal model recommended for non-English languages)
jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Related repos

HiFi-Gan HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Acknowledgements

This implementation uses code from the following repos: Nvidia/Tacotron2

creotiv/RussianTTS-Tacotron2