MelGAN-Waveform-synthesis

What is Text To Speech (TTS)?

Voice synthesis via converting text input into audio output
Audio modeling

Stage 1. Models the intermediate representation given text as input

Stage 2. Transforms the intermediate representation back to audio (i.e., vocoder)
Representation
- Typically chosen to be easier to model than raw audio while preserving enough information to allow faithful inversion back to audio

Mel-spectrogram (example data from LJ Speech dataset [LJ001-0001])
LJ Speech dataset:

This is consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.

Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Generator
- The generator is a fully convolutional feed-forward network with mel-spectrogram as input and raw waveform as output.
- Stack of transposed convolutional layers are used to upsample the input sequence and each trnasposed convolutional layers is followed by a stack of residual blocks with dilated convolutions.
Discriminator
- Multi-scale architecture with three discriminators that have identical network structure but operate on different audio scales are adopted.
- This structure has an inductive bias that each discriminator learns features for different frequency range of the audio.