(Ongoing) Zero-shot TTS based on VITS

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Note

This repository aims to implement a VITS-based zero-shot TTS system varying with diverse style/speaker conditioning methods.
To remove the secondary elements, we simply extract a style representation by jointly training a reference encoder from StyleSpeech. In detail, 1. we do not utilize pretrained models (e.g., Link1, Link2) as the reference encoder, 2. we do not apply meta-learning or speaker verification loss during training.
LibriTTS dataset (train-clean-100 and train-clean-360) is used for training.

Model	Text Encoder	Flow	Posterior Encoder	Vocoder
`master`(YourTTS)	Output addition	Global conditioning	Global conditioning	Input addition
`transfer`(TransferTTS)	None	Global conditioning	None	None
`s1`(Proposed)	SC-CNN	Global Conditioning	Global Conditioning	Input addition
`s2`(Proposed)	SC-CNN	SC-CNN	SC-CNN	TBD

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

python train_zs.py -c configs/libritts_base.json -m libritts_base