/MelGAN-Waveform-synthesis

Pytorch re-implementation of MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.

Primary LanguagePython

MelGAN-Waveform-synthesis

Pytorch re-implementation of MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.

What is Text To Speech (TTS)?

  • Voice synthesis via converting text input into audio output

  • Audio modeling

    Stage 1. Models the intermediate representation given text as input

    Stage 2. Transforms the intermediate representation back to audio (i.e., vocoder)

  • Representation

    • Typically chosen to be easier to model than raw audio while preserving enough information to allow faithful inversion back to audio

fig2

Basic Knowledge

  • Mel-spectrogram (example data from LJ Speech dataset [LJ001-0001])

  • LJ Speech dataset:

    This is consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.

    Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

fig1

MelGAN

  • Generator

    • The generator is a fully convolutional feed-forward network with mel-spectrogram as input and raw waveform as output.
    • Stack of transposed convolutional layers are used to upsample the input sequence and each trnasposed convolutional layers is followed by a stack of residual blocks with dilated convolutions.
  • Discriminator

    • Multi-scale architecture with three discriminators that have identical network structure but operate on different audio scales are adopted.
    • This structure has an inductive bias that each discriminator learns features for different frequency range of the audio.

gd