DCTTS is introduced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention.
- NumPy >= 1.11.1
- TensorFlow >= 1.3 (Note that the API of
tf.contrib.layers.layer_norm
has changed since 1.3) - librosa
- tqdm
- matplotlib
- scipy
I train Portuguese models with
- STEP 0. Download TTS-Portuguese Corpus or prepare your own data.
- STEP 1. Run
python prepro.py
. - STEP 2. Run
python train.py 1
for training Text2Mel. - STEP 3. Run
python train.py 2
for training SSRN.
You can do STEP 2 and 3 at the same time, if you have more than one gpu card.
I generate speech samples based on phonetically balanced sentences as the original paper does. It is already included in the repo.
- Run
synthesize.py
and check the files insamples
.
| Dataset | Samples | | :-------------| | TTS-Portuguese Corpus with Text |2115k|
| TTS-Portuguese Corpus with Phoneme |1734k|
A notebook supposed to be executed on https://colab.research.google.com is available:
TTS-Portuguese Corpus with Text Download this.
TTS-Portuguese Corpus with Phoneme Download this.
-
The changes not described in the paper were inspired by the repository: dc_tts
-
The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
- The paper didn't mention dropouts. So I added 0.05 for all layers.
-
The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
-
This implementation is inspired by the repository: dc_tts