Transformer Text To Speech

A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. Now with recent development in deep learning, it's possible to convert text into a human-understandable voice. For this, the text is fed into an Encoder-Decoder type Neural Network to output a Mel-Spectrogram. This Mel-Spectrogram can now be used to generate audio using the "Griffin-Lim Algorithm". But due to its disadvantage that it is not able to produce human-like speech quality, another neural net named WaveNet is employed, which is fed by Mel-Spectrogram to produce audio that even a human is not able to differentiate apart.

Model Architecture

1. Transformer TTS

An Encoder-Decoder transformer architecture for parallel training instead for Seq2Seq training incase of Tacotron-2.
Text are sent as input and the model outputs a Mel-Spectrogram.
Multi-headed attention is employed, with causal masking only on the decoder side.
Paper : Neural Speech Synthesis with Transformer Network.

2. Wavenet

Output of the Transformer tts (Mel-Spectrogram) is fed into the Wavenet to generate audio samples.
Unlike Seq2Seq models wavenet also allows parallel training.
Paper : WaveNet: A Generative Model for Raw Audio.

Dataset Information

The model was trained on a subset of WMT-2014 English-German Dataset. Preprocessing was carried out before training the model.
Dataset : https://keithito.com/LJ-Speech-Dataset/

ShivamRajSharma/Transformer-Text-To-Speech

Transformer Text To Speech

Model Architecture

1. Transformer TTS

2. Wavenet

Dataset Information