PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.
This implementation includes distributed and automatic mixed precision support and uses the RUSLAN dataset.
Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.
https://soundcloud.com/andrey-nikishaev/sets/russian-tts-nvidia-tacotron2
- Added Diagonal guided attention (DGA) from another model https://arxiv.org/abs/1710.08969
- Added Maximizing Mutual Information for Tacotron (MMI) https://arxiv.org/abs/1909.01145
- Can't make it work as showed in paper
- DGA still gives better results, and much cleaner
- Added Russian text preparation with simple stress dictionary (za'mok i zamo'k)
- Using HiFi GAN
- NVIDIA GPU + CUDA cuDNN
- Download and extract the RUSLAN dataset
- Clone this repo:
git clone https://github.com/NVIDIA/tacotron2.git
- CD into this repo:
cd tacotron2
- Install PyTorch 1.0
- Install Apex
- Install python requirements or build docker image
- Install python requirements:
pip install -r requirements.txt
- Install python requirements:
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored
- Download our published Ruslan Model or LJ Speech model
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
- Download our published Ruslan Model or LJ Speech model
- Download published HiFi-GAN Model (Universal model recommended for non-English languages)
jupyter notebook --ip=127.0.0.1 --port=31337
- Load inference.ipynb
N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.
HiFi-Gan HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
This implementation uses code from the following repos: Nvidia/Tacotron2