UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Overview

This project is an unofficial implementation of UniCATS paper.

Please note that as the official implementation has been released for the CTX-vec2wav model, this repository will be using the same setup. This provides consistency and compatibility for future updates to the project.

Note: Please refer to the official implementations of CTX-text2vec and CTX-vec2wav.

Setup

To get started, run the following after going inside the repository's root directory:

pip install -e .

Dataset

This project is using the LibriTTS dataset in the 24 kHz sampling rate. To follow the same dataset splits as in the paper, please follow the steps on this guide.

Credits

VQ Diffusion by Microsoft
UniCATS-CTX-vec2wav by cantabile-kwok

Citation

@article{du2023unicats,
  title={UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding},
  author={Du, Chenpeng and Guo, Yiwei and Shen, Feiyu and Liu, Zhijun and Liang, Zheng and Chen, Xie and Wang, Shuai and Zhang, Hui and Yu, Kai},
  journal={arXiv preprint arXiv:2306.07547},
  year={2023}
}