Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic-Speech-Recognition

UnsupTTS is an unsupervised text-to-speech (TTS) system learned from unparallel speech and text data

If you find this project useful, please consider citing our paper.

@inproceedings{Ni-etal-2022-unsup-tts,
  author={Junrui Ni and Liming Wang and Heting Gao and Kaizhi Qian and Yang Zhang and Shiyu Chang and Mark Hasegawa-Johnson},
  title={Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition},
  booktitle={arKiv},
  year={2022},
  url={https://arxiv.org/pdf/2203.15796.pdf}
}

Speech demo

Speech samples can be found here

Dependencies

fairseq >= 1.0.0 with dependencies for wav2vec-u
ESPnet <= 010f483e7661019761b169563ee622464125e56f
ParallelWaveGAN
LanguageNet G2Ps (For models using phoneme transcripts only)

How to run it?

Download the LJSpeech and CSS10 datasets; modify the paths and settings in source_code/unsupervised/run_css10_cpy2.slurm and tts1/css10_nl/run.sh. Current default language is Dutch (nl) with phoneme transcripts, but you can change the $lang variable to change the language and $trans_type variable to change the transcript type.
Run bash run_css10_cpy2.slurm

Pretrained models

LJSpeech	ASR	TTS
en	link	link

CSS10	Unit	ASR	TTS
ja	char	link	link
hu	char	link	link
nl	char	link	link
fi	char	link	link
es	char	link	link
de	char	link	link
hu	phn	link	link
nl	phn	[link]	link
fi	phn	link	link

abylouw/UnsupTTS

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic-Speech-Recognition

Speech demo

Dependencies

How to run it?

Pretrained models