A compilation of Text-to-Speech Synthesis projects
-
NVIDIA's Tacotron 2
[Paper] https://arxiv.org/pdf/1712.05884.pdf
[Code] https://github.com/NVIDIA/tacotron2 -
NVIDIA's OpenSeq2Seq
[Paper] https://nvidia.github.io/OpenSeq2Seq/
[Code] https://github.com/NVIDIA/OpenSeq2Seq -
Deep Convolutional TTS
[Paper] https://arxiv.org/pdf/1710.08969.pdf
[Code] https://github.com/Kyubyong/dc_tts
*Implemented by a third-party and not by the writers themselves -
Google's Tacotron
[Paper] https://arxiv.org/pdf/1703.10135.pdf
[Code] https://github.com/keithito/tacotron
[Code] https://github.com/MycroftAI/mimic2
*Tensorflow implementation of Tacotron, not by the writers themselves -
Mozilla Text-to-Speech
[Code] https://github.com/mozilla/TTS -
Stanford's GloVe
[Documentation] https://nlp.stanford.edu/projects/glove/
[Code] https://github.com/stanfordnlp/GloVe -
DeepMind's GAN-TTS [Documentation] https://arxiv.org/pdf/1909.11646.pdf
[Code] https://github.com/yanggeng1995/GAN-TTS
- Multi-Speaker Tacotron in TensorFlow
[Code] https://github.com/carpedm20/multi-speaker-tacotron-tensorflow - DeepVoice Series
[DeepVoice 2] https://github.com/jdbermeol/deep_voice_2
[DeepVoice 3] https://github.com/r9y9/deepvoice3_pytorch
** Most MS-TTS are unofficial code implementations
Uses any or a combination of existing works, but applied in the Tagalog language. For this project, using NVIDIA's tacotron2 and waveglow provided the best results despite the networks being optimized for single-speaker data and our tagalog dataset being multi-speaker. This might be because, given that tacotron2 trains on per-character level, it properly learns the voice-independent features such as prosody. Hence, the network was able to capture this information but fails in modeling the voice.
Training was done similar to NVIDIA and Ryuichi Yamamoto's deepvoice3. Data was edited and organised to match the expected inputs of the networks, and config files were changed to match the tagalog dataset.
Training tacotron2: python train.py --output_directory \[output dir] --log_directory \[log dir] -c \[optional, checkpoint file]
Training waveglow (in waveglow folder): python train.py -c config.json
Training deepvoice3 (in deepvoice3 folder): python train.py --data-root=\[data file] --preset=\[preset file] --checkpoint=\[optional, checkpoint file]
Checkpoints can be found here: checkpoints
Adding in Kobayashi's Sprocket was supposedly a test if whether implementing a voice conversion after the network would mitigate the grittiness of the output. As expected, results showed no improvements to poor performance, especially when tested with longer sentences.
Training was done by, first, generating the source voice using network and the target taken from the data. Both source and target must speak the same words. Moreover, all target data must come from a single speaker. This can be done manually. Or you can download some of our used data here, and paste it inside /sprocket/example/data/
For training and/ generation, please follow the steps here