tts_models

A compilation of Text-to-Speech Synthesis projects

Famous Works

Single-Speaker TTS

NVIDIA's Tacotron 2
[Paper] https://arxiv.org/pdf/1712.05884.pdf
[Code] https://github.com/NVIDIA/tacotron2
NVIDIA's OpenSeq2Seq
[Paper] https://nvidia.github.io/OpenSeq2Seq/
[Code] https://github.com/NVIDIA/OpenSeq2Seq
Deep Convolutional TTS
[Paper] https://arxiv.org/pdf/1710.08969.pdf
[Code] https://github.com/Kyubyong/dc_tts
*Implemented by a third-party and not by the writers themselves
Google's Tacotron
[Paper] https://arxiv.org/pdf/1703.10135.pdf
[Code] https://github.com/keithito/tacotron
[Code] https://github.com/MycroftAI/mimic2
*Tensorflow implementation of Tacotron, not by the writers themselves
Mozilla Text-to-Speech
[Code] https://github.com/mozilla/TTS
Stanford's GloVe
[Documentation] https://nlp.stanford.edu/projects/glove/
[Code] https://github.com/stanfordnlp/GloVe
DeepMind's GAN-TTS [Documentation] https://arxiv.org/pdf/1909.11646.pdf
[Code] https://github.com/yanggeng1995/GAN-TTS

Other Directories

Multi-Speaker TTS

Multi-Speaker Tacotron in TensorFlow
[Code] https://github.com/carpedm20/multi-speaker-tacotron-tensorflow
DeepVoice Series
[DeepVoice 2] https://github.com/jdbermeol/deep_voice_2
[DeepVoice 3] https://github.com/r9y9/deepvoice3_pytorch
** Most MS-TTS are unofficial code implementations

Tagalog Text-to-Speech Synthesis

Uses any or a combination of existing works, but applied in the Tagalog language. For this project, using NVIDIA's tacotron2 and waveglow provided the best results despite the networks being optimized for single-speaker data and our tagalog dataset being multi-speaker. This might be because, given that tacotron2 trains on per-character level, it properly learns the voice-independent features such as prosody. Hence, the network was able to capture this information but fails in modeling the voice.

Training was done similar to NVIDIA and Ryuichi Yamamoto's deepvoice3. Data was edited and organised to match the expected inputs of the networks, and config files were changed to match the tagalog dataset.

Training tacotron2: python train.py --output_directory \[output dir] --log_directory \[log dir] -c \[optional, checkpoint file]
Training waveglow (in waveglow folder): python train.py -c config.json
Training deepvoice3 (in deepvoice3 folder): python train.py --data-root=\[data file] --preset=\[preset file] --checkpoint=\[optional, checkpoint file]

Checkpoints can be found here: checkpoints

Voice Conversion Option

Adding in Kobayashi's Sprocket was supposedly a test if whether implementing a voice conversion after the network would mitigate the grittiness of the output. As expected, results showed no improvements to poor performance, especially when tested with longer sentences.

Training was done by, first, generating the source voice using network and the target taken from the data. Both source and target must speak the same words. Moreover, all target data must come from a single speaker. This can be done manually. Or you can download some of our used data here, and paste it inside /sprocket/example/data/

For training and/ generation, please follow the steps here

Danigy/tts_models