- python 3.6
- pytorch 1.0
- librosa, scipy, tqdm, tensorboardX
- LJ Speech 1.1, female single speaker dataset.
- I follow Kyubyong's DCTTS repo with TensorFlow for preprocessing speech signal data. It actually worked well.
-
Download the above dataset and modify the path in config.py. And then run the below command. 1st arg: signal prepro, 2nd arg: metadata (train/test split)
python prepro.py 1 1
-
The model needs to train 100k+ steps (10+ hours).
python train.py
-
After training, you can synthesize some speech from text.
python synthesize.py
- In speech synthesis, the attention module is important. If the model is normally trained, then you can see the monotonic attention like the follow figures.
- I used bilinear attention instead of MLP attention in the model.
- I adjusted some momentums to stabilize the model. It alliviates overfitting.