Rafael Valle*, Jason Li*, Ryan Prenger and Bryan Catanzaro
In our recent paper we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.
By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.
Visit our website for audio samples.
Pre-requisites
- NVIDIA GPU + CUDA cuDNN
Setup
- Clone this repo:
git clone https://github.com/NVIDIA/mellotron.git
(Install Git for Windows): Download - CD into this repo:
cd mellotron
- Initialize submodule:
git submodule init; git submodule update
- Install Anaconda: Download
- conda create -n mellotron python=3.8
- conda activate mellotron
- Install conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
- Install Apex
- Install python requirements or build docker image
- Install python requirements:
pip install -r requirements.txt
- Install python requirements:
Training
- Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
Training using a pre-trained model
Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is ignored
- Download our published Mellotron model trained on LibriTTS or LJS
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
Multi-GPU (distributed) and Automatic Mixed Precision Training
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
Inference demo
jupyter notebook --ip=127.0.0.1 --port=31337
- Load inference.ipynb
- (optional) Download our published WaveGlow model
Related repos
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.
Acknowledgements
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.