/Glow_TTS

An implement of GlowTTS model. Several modes are added: speaker embedding, prosody encoder(GST), and gradient reversal.

Primary LanguagePythonMIT LicenseMIT

Multispeaker GlowTTS

Requirements

  • torch >= 1.5.1

  • tensorboardX >= 2.0

  • librosa >= 0.7.2

  • matplotlib >= 3.1.3

  • Optional for loss flow

    • tensorboard >= 2.2.2

Structure

Vanilla mode (Single speaker GlowTTS)

Training

Inference

Speaker embedding mode

Training

Inference

Prosody encoding mode (GST GlowTTS)

Training

Inference

Gradient reversal mode (Voice cloning GlowTTS - Failed)

Training

Inference

Used dataset

  • Currently uploaded code is compatible with the following datasets.
  • The O marks to the left of the dataset name are the dataset actually used in the uploaded result.
Single Multi Dataset Dataset address
O O LJSpeech https://keithito.com/LJ-Speech-Dataset/
X X BC2013 http://www.cstr.ed.ac.uk/projects/blizzard/
X O CMU Arctic http://www.festvox.org/cmu_arctic/index.html
X O VCTK https://datashare.is.ed.ac.uk/handle/10283/2651
X X LibriTTS https://openslr.org/60/

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in 'Hyper_Parameters.yaml' according to your environment.

  • Sound

    • Setting basic sound parameters.
    • Some paramters like pitch are not used in current code. These are for future works.
  • Use_Cython_Alignment

    • Setting which implementation of Monotonic alignment search to use
    • If true, the cython implementation of official code will be used.
    • If false, the python implementation will be used.
    • I recommend to use cython implementation because of speed.
  • Encoder

    • Setting the encoder parameters
  • Decoder

    • Setting the glow decoder parameters.
  • WaveNet

    • Setting the parameters of Vocoder.
    • This implementation uses a pre-trained Parallel WaveGAN model.
    • If checkpoint path is null, model does not exports wav files.
    • If checkpoint path is not null, all parameters must be matched to pre-trained Parallel WaveGAN model.
  • Speaker_Embedding

    • Setting the speaker embedding generating method
    • In Type, you can select null, 'LUT', 'GE2E'
      • null: No speaker embedding. Single speaker version
      • LUT: Model will generate a lookup table about the speakers.
      • GE2E: Model will use d-vectors which is generated by a pretrained GE2E model.
  • Token path

    • Setting the token-to-index dict.
    • Pattern generator makes this file.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference.
    • If null, it will be same to Train/Batch_Size
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting mixed precision.
    • To use, Nvidia apex must be installed in the environment.
    • In several preprocessing hyper parameters, loss overflow problem occurs.
  • Device

    • Setting which GPU device is used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'.

Generate pattern

Command

python Pattern_Generate.py [parameters]

Parameters

At least, one or more of datasets must be used.

  • -lj
    • Set the path of LJSpeech. LJSpeech's patterns are generated.
  • -bc2013
    • Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013's patterns are generated.
  • -cmua
    • Set the path of CMU arctic. CMU arctic's patterns are generated.
  • -vctk
    • Set the path of VCTK. VCTK's patterns are generated.
  • -libri
    • Set the path of LibriTTS. LibriTTS's patterns are generated.
  • -vc1
    • Set the path of VoxCeleb1. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -vc2
    • Set the path of VoxCeleb2. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -vc1t
    • Set the path of VoxCeleb1 testset. Glow-TTS does not supports this because VoxCeleb datasets do not have text data.
  • -text
    • Set whether the text information save or not.
    • This is for other model. To use in Glow TTS, this option must be set.
  • -evalr
    • Set the evaluation pattern ratio.
    • Default is 0.001.
  • -evalm
    • Set the evaluation pattern minimum of each speaker.
    • Default is 1.
  • -mw
    • The number of threads used to create the pattern
    • Default is 10.

Run

Command

python Train.py -s <int>
  • -s <int>
    • The resume step parameter.
    • Default is 0.
    • When this parameter is 0, model try to find the latest checkpoint in checkpoint path.

Inference

Result

Please see at the demo site

Trained checkpoint

Mode Dataset Trained steps Link
Vanilla LJ 100000 Link(Broken)
SE & LUT LJ + CUMA 100000 Link
SE & LUT LJ + VCTK 100000 Link
PE LJ + CUMA 100000 Link
PE LJ + VCTK 400000 Link
GR & LUT LJ + VCTK 400000 Link(Failed)

Future works

  • Training with GE2E speaker embedding
  • Gradient reversal model structure improvement
  • Training additional steps