VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
- This repository aims to implement a VITS-based zero-shot TTS system varying with diverse style/speaker conditioning methods.
- To remove the secondary elements, we simply extract a style representation by jointly training a reference encoder from StyleSpeech. In detail, 1. we do not utilize pretrained models (e.g., Link1, Link2) as the reference encoder, 2. we do not apply meta-learning or speaker verification loss during training.
- LibriTTS dataset (train-clean-100 and train-clean-360) is used for training.
Model | Text Encoder | Flow | Posterior Encoder | Vocoder |
---|---|---|---|---|
master (YourTTS) |
Output addition | Global conditioning | Global conditioning | Input addition |
transfer (TransferTTS) |
None | Global conditioning | None | None |
s1 (Proposed) |
SC-CNN | Global Conditioning | Global Conditioning | Input addition |
s2 (Proposed) |
SC-CNN | SC-CNN | SC-CNN | TBD |
-
master
-
transfer
-
s1
-
s2
- Python >= 3.6
- Clone this repository
- Install python requirements. Please refer requirements.txt
- You may need to install espeak first:
apt-get install espeak
- You may need to install espeak first:
- Download datasets
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
python train_zs.py -c configs/libritts_base.json -m libritts_base
See inference.ipynb