An tensorflow implementation of the Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
-
Install Python 3.
-
Install the latest version of TensorFlow for your platform. For better performance, install with GPU support if it's available. This code works with TensorFlow 1.3 or 1.4.
-
Install requirements:
pip install -r requirements.txt
Note: you need at least 40GB of free disk space to train a model.
-
Download a speech dataset.
The following are supported out of the box:
- LJ Speech (Public Domain)
- Blizzard 2012 (Creative Commons Attribution Share-Alike)
You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.
-
Unpack the dataset into
~/tacotron
After unpacking, your tree should look like this for LJ Speech:
tacotron |- LJSpeech-1.0 |- metadata.csv |- wavs
or like this for Blizzard 2012:
tacotron |- Blizzard2012 |- ATrampAbroad | |- sentence_index.txt | |- lab | |- wav |- TheManThatCorruptedHadleyburg |- sentence_index.txt |- lab |- wav
-
Preprocess the data
python3 preprocess.py --dataset ljspeech
- Use
--dataset blizzard
for Blizzard data
- Use
-
Train a model
python3 train.py
Tunable hyperparameters are found in hparams.py. You can adjust these at the command line using the
--hparams
flag, for example--hparams="batch_size=16,outputs_per_step=2"
. Hyperparameters should generally be set to the same values at both training and eval time. -
Monitor with Tensorboard (optional)
tensorboard --logdir ~/tacotron/logs-tacotron
The trainer dumps audio and alignments every 1000 steps. You can find these in
~/tacotron/logs-tacotron
. -
Synthesize from a checkpoint
python3 demo_server.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
Replace "185000" with the checkpoint number that you want to use, then open a browser to
localhost:9000
and type what you want to speak. Alternately, you can run eval.py at the command line:python3 eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
If you set the
--hparams
flag when training, set the same value here.
- Keithito's implementation of tacotron: https://github.com/keithito/tacotron
- Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous. 2018. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
- By syang1993: https://github.com/syang1993/gst-tacotron