Tacotron

An implementation of Tacotron speech synthesis in Tensorflow.

WIP

Mycroft-core TTS engine

Background

Earlier this year, Google published a paper, Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs. However, they didn't release their source code or training data. This is an attempt to provide an open-source implementation of the model described in their paper.

The quality isn't as good as Google's demo yet, but hopefully it will get there someday :-). Pull requests are welcome!

Quick Start

Installing dependencies

./requirements.sh
pip install -r requirements.txt

Using a pre-trained model

run test.py

Training

Note: you need at least 40GB of free disk space to train a model.

Download a speech dataset.

The following are supported out of the box:
- LJ Speech (Public Domain)
- Blizzard 2012 (Creative Commons Attribution Share-Alike)
You can use other datasets if you convert them to the right format. See TRAINING_DATA.md for more info.

Unpack the dataset into ~/tacotron

After unpacking, your tree should look like this for LJ Speech:

tacotron
  |- LJSpeech-1.0
      |- metadata.csv
      |- wavs

or like this for Blizzard 2012:

tacotron
  |- Blizzard2012
      |- ATrampAbroad
      |   |- sentence_index.txt
      |   |- lab
      |   |- wav
      |- TheManThatCorruptedHadleyburg
          |- sentence_index.txt
          |- lab
          |- wav

Preprocess the data
```
python preprocess.py --dataset ljspeech
```
- Use --dataset blizzard for Blizzard data
Train a model
```
python train.py
```
Monitor with Tensorboard (optional)
```
tensorboard --logdir ~/tacotron/logs-tacotron
```
The trainer dumps audio and alignments every 1000 steps. You can find these in ~/tacotron/logs-tacotron.
Synthesize from a checkpoint

Replace "185000" with the checkpoint number that you want to use run eval.py at the command line:
```
python eval.py --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
```

Miscellaneous Notes

TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting LD_PRELOAD=/usr/lib/libtcmalloc.so.
You can train with CMUDict by downloading the dictionary to ~/tacotron/training and then passing the flag --hparams="use_cmudict=True" to train.py. This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval time to force a particular pronunciation, e.g. Turn left on {HH AW1 S S T AH0 N} Street.
If you pass a Slack incoming webhook URL as the --slack_url flag to train.py, it will send you progress updates every 1000 steps.
Occasionally, you may see a spike in loss and the model will forget how to attend (the alignments will no longer make sense). Although it will recover eventually, it may save time to restart at a checkpoint prior to the spike by passing the --restore_step=150000 flag to train.py (replacing 150000 with a step number prior to the spike). Update: a recent fix to gradient clipping by @candlewill may have fixed this.

Other Implementations

By Alex Barron: https://github.com/barronalex/Tacotron
By Kyubyong Park: https://github.com/Kyubyong/tacotron

JarbasAI/ZZZ-tacotron