/Tacotron

Tacotron + WaveRNN implementation

Primary LanguageJupyter NotebookMIT LicenseMIT

Tacotron + WaveRNN

A modified version of Fatchord/WaveRNN repository , which implements a variant of the system described in Tacotron: Towards End-to-End Speech Synthesis.

Additions to original repo:

Tacotron with WaveRNN diagrams

Installation

Ensure you have:

Install by creating a virtual environment using the following commands and install the the rest with pip:

conda create -n taco python=3.6 anaconda

conda install pytorch=1.0.1 torchvision cudatoolkit=8.0 -c pytorch

pip install -r requirements.txt

How to Use

Quick Start

To test that everything is installed and working correctly run:

python quick_start.py --hp_file hp_JE.py

This will generate everything in the default phonemes.txt file and synthesize wavs which are saved in the 'quick_start' folder

Training your own Models

Download the Blizzard Challenge 2013 wavs and Blizzard Challenge 2013 txt.

In the same wav directory, make sure you have the corresponding transcription/metadata file which looks like this:

audio || text | phoneme sequence

CB-JE-01-59||but Eliza just put her head in at the door, and said at once "She is in the window-seat, to be sure, Jack."|<START> b ah t <> ih l ay z ax <> jh ah s t <> p uh t <> hh er <> hh eh d <> ih n <> ae t <> dh ax <> d ao r <,> ae n d <> s eh d <> ae t <> w ah n s <> sh iy <> ih z <> ih n <> dh ax <> w ih n d ow <> s iy t <,> t ax <> b iy <> sh uh r <,> jh ae k <.> <END>

1 - Edit hparams.py or create new hparams eg., hp_JE.py

Edit the following parameters accordingly in hparams.py:

      wav_path = 'data/wavs_train/'

      data_path = 'data/CB_JE'

      book_names = ['CB-JE']

      voc_model_id = 'blizzard_vocoder'

      tts_model_id = 'blizzard_baseline_JE'

      metadata = "train.csv"

2 - Next run the following to perform feature extraction:

python preprocess.py --hp_file hp_JE.py

This extracts mels from wavs and dumps your linguistic features in a pickle file in the location specified by data_path in hparams.py.

To train the model:

3 - Train Tacotron using:

python train_tacotron.py --hp_file hp_JE.py

4 - To generate from the model:

Edit the autogen.sh script:

python test_taco_model_auto.py --checkpoint blizzard_baseline_JE.tacotron/taco_step300K_weights.pyt --model_name JE

This will generate the wavs using the test.txt phoneme sequences and save them in model_output.

Recipes

Recipe to train model with pre-aligned guides

Pretrained Models

Currently there are 3 pretrained models available in the /pretrained/ folder':

Both are trained on Blizzard 2013:

  • Tacotron trained using 1 book (JE) to 300k steps
  • Tacotron trained using 4 books (that have wavs that have been carefully selected to control variability in speech rate) to 300k steps
  • WaveRNN (Mixture of Logistics output) using LJSpeech trained to 800k steps (taken from original Fatchord/WaveRNN repo)

References

Acknowlegements