/tacotron2-pytorch

Pytorch implementation of Tacotron 2 (https://arxiv.org/pdf/1712.05884.pdf)

Primary LanguagePython

Tacotron2 Pytorch

Hits

A PyTorch implementation of Tacotron2, described in Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions, an end-to-end text-to-speech(TTS) neural network architecture, which directly converts character text sequence to speech.

  • Use log mel spectrogram and Waveglow Vocoder to synthesize audios
  • Change dimension of tensors from (N, T, C) to (N, C, T)
    • N : batch size, C : channels, T : time steps
  • Add stop status on inference time.
  • Add thiner pre-net to get more accurate attention.
  • And little bit different text encoder.

Environment

  • Ubuntu 16.04
  • Python 3.6
  • PyTorch 1.2.0
  • 2 GPUs

Install

  • Install above external repos

You should see first README.md of pytorch_sound, to prepare dataset.

$ pip install git+https://github.com/Appleholic/pytorch_sound
  • Install package
$ pip install -e .

Usage

  • Train
$ python tacotron2_pytorch/train.py [YOUR_META_DIR] [SAVE_DIR] [SAVE_PREFIX] [[OTHER OPTIONS...]]
  • Synthesize (one sample)
    • It writes audio, wave plot, attention and mel spectrogram image.
$ python tacotron2_pytorch/synthesize.py [TEXT] [PRETRAINED_PATH] [MODEL_NAME] [SAVE_DIRECTORY]

Known Issues

  • When inference time, spectrogram got several stripes. It might be occurred by hard drop out. (Not appear on training time)
  • Stop token is not working well on inference time.
  • Error case and resolve them: Torchhub waveglow

Results

  • Total Validation Loss
    • Sum of 2 MSE Losses (linear, linear + post) and stop BCE Loss
    • red : pre net 64 dim, blue : pre net 256 dim
    • 100,000 steps

Validation Loss curve

  • Attention, Mel Spectrogram Sample

test sample