CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models
Generated samples are accessible through the following link: Research showcase
DATASET refers to the names of datasets such as LibriTTS and VCTK in the following documents.
You can install the Python dependencies with:
pip3 install -r requirements.txt
You have to download the pretrained models(To prevent anonymity from leaking, we share the link later) and put them inoutput/pretrained_model/DATASET/CMDenoiserTTS/
.
Synthesize on VCTK.
bash single_synthesize_vctk.sh
Synthesize on LJSpeech.
bash single_synthesize_lj.sh
Synthesize on LibriTTS.
bash single_synthesize_lib.sh
Synthesize on VCTK.
bash synthesize_vctk.sh
Synthesize on LJSpeech.
bash synthesize_lj.sh
Predict on LibriTTS.
bash synthesize_lib.sh
You can achieve zero-shot synthesis across datasets using the following approach:
Train on the LibriTTS dataset and predict on VCTK.
bash synthesize_lib2vctk.sh
Train on the LibriTTS dataset and predict on LJSpeech.
bash synthesize_lib2lj.sh
For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.
Before starting the processing, please check if .\config\DATASET\preprocess.yaml
has been configured according to your preferences.
After completing the above preparations, you can accomplish data processing by running the corresponding script. The processing script is as follows:
LJSpeech:
bash deal_data_Lj.sh
VCTK:
bash deal_data_VCTK.sh
LibriTTS:
bash deal_data_Lib.sh
Before starting the training, please ensure that all the configurations under .\config\DATASET
have been set according to your preferences, and the data has been processed as described above.
You can perform training on different datasets.
LJSpeech:
python3 train_cm.py --model consistency_training --dataset LJSpeech
VCTK:
python3 train_cm.py --model consistency_training --dataset VCTK
LibriTTS:
python3 train_cm.py --model consistency_training --dataset LibriTTS
Experiment Description for MOS:In this supplementary experiment, we added the MOS metric test results for all models on the VCTK and LJSpeech datasets to Tables 1 and 2. For Table 3 and 4, we added the MOS test results for CM-TTS and DiffGAN-TTS models in a zero-shot scenario. For Table 5, we added the MOS metric test results for CM-TTS under different sampler settings. For Table 6, we added the MOS metric test results for CM-TTS under different loss settings. The specific results are as follows.
Models | MOS |
---|---|
Reference (voc.) | 4.5826(±0.1147) |
FastSpeech2(300K) | 3.6821(±0.1762) |
VITS | 3.6717(±0.0123) |
DiffSpeech | 2.9157(±0.0594) |
DiffGAN-TTS(T=1) | 3.4476(±0.1038) |
DiffGAN-TTS(T=2) | 3.6173(±0.1433) |
DiffGAN-TTS(T=4) | 3.6143(±0.1186) |
CM-TTS(T=1) | 3.9618(±0.0186) |
CM-TTS(T=2) | 3.8947(±0.0262) |
CM-TTS(T=4) | 3.8623(±0.0311) |
Models | MOS |
---|---|
Reference (voc.) | 4.8667(±0.0315) |
FastSpeech2(300K) | 3.5742(±0.2309) |
DiffSpeech | 3.1668(±0.1378) |
CoMoSpeech | 3.5583(±0.2421) |
VITS | 3.6234(±0.0252) |
DiffGAN-TTS(T=1) | 3.7142(±0.1390) |
DiffGAN-TTS(T=2) | 3.6813(±0.0561) |
DiffGAN-TTS(T=4) | 3.7258(±0.0087) |
CM-TTS(T=1) | 3.8353(±0.0179) |
CM-TTS(T=2) | 3.7917(±0.1356) |
CM-TTS(T=4) | 3.7602(±0.1327) |
Models | MOS |
---|---|
Reference (voc.) | 4.7467(±0.0194) |
DiffGAN-TTS(T=1) | 3.4607(±0.1880) |
DiffGAN-TTS(T=2) | 3.5067(±0.1573) |
DiffGAN-TTS(T=4) | 3.5893(±0.0298) |
CM-TTS(T=1) | 3.8715(±0.0896) |
CM-TTS(T=2) | 3.8387(±0.1521) |
CM-TTS(T=4) | 3.9221(±0.1016) |
Models | MOS |
---|---|
Reference (voc.) | 4.8832(±0.0174) |
DiffGAN-TTS(T=1) | 3.6047(±0.1015) |
DiffGAN-TTS(T=2) | 3.6212(±0.0771) |
DiffGAN-TTS(T=4) | 3.7361(±0.1802) |
CM-TTS(T=1) | 3.7205(±0.1097) |
CM-TTS(T=2) | 3.6817(±0.1328) |
CM-TTS(T=4) | 3.7113(±0.1022) |
Types | MOS |
---|---|
Reference (voc.) | 4.7172(±0.1236) |
Uniform | 3.8133(±0.0727) |
Linear(↗) | 3.3278(±0.0803) |
Linear(↘) | 3.5676(±0.1488) |
LSM | 3.9107(±0.1254) |
Types | MOS |
---|---|
Reference (voc.) | 4.6304(±0.1418) |
L1 | 3.9052(±0.0415) |
L1 (w/o padding) | 3.8117(±0.1005) |
L2 | 3.8726(±0.1971) |
L2 (w/o padding) | 3.8604(±0.1436) |
Earlier implementations of the FastSpeech 2 model relied on directly importing checkpoints, possibly resulting in loading errors. We have now retrained the model and reassessed the relevant metrics. After carefully re-checking, the updated metrics are now available in the table below.
Dataset | FFE(↓) | Cos-speaker(↑) | mfccFID(↓) | melFID(↓) | mfccRecall(↑) | MCD(↓) | SSIM(↑) | mfccCOS(↑) | F0-RMSE(↓) | wer |
---|---|---|---|---|---|---|---|---|---|---|
FastSpeech2(VCTK) | 0.3503 | 0.8236 | 43.4236 | 8.8175 | 0.3554 | 5.8897 | 0.4537 | 0.7565 | 119.2076 | 0.0677 |
FastSpeech2(LJSpeech) | 0.4877 | 0.8825 | 36.3090 | 5.2796 | 0.2121 | 6.1157 | 0.6468 | 0.7985 | 135.2583 | 0.0944 |
To verify the individual contributions of CT and LSM to the model's performance, we conducted ablation experiments by separately removing CT and LSM. The experimental results are presented below.
Models | FFE(↓) | Cos-speaker(↑) | mfccFID(↓) | melFID(↓) | mfccRecall(↑) | MCD(↓) | SSIM(↑) | mfccCOS(↑) | F0-RMSE(↓) | WER(↓) |
---|---|---|---|---|---|---|---|---|---|---|
CM-T1 | 0.3387 | 0.8396 | 39.17 | 7.58 | 0.3946 | 5.91 | 0.4772 | 0.7599 | 119.29 | 0.0688 |
-CT | 0.3364 | 0.835074 | 43.1316 | 10.74238 | 0.40103 | 5.9821 | 0.4626 | 0.7545 | 122.69101 | 0.0832 |
-LSM | 0.3351 | 0.8333 | 56.31 | 10.08 | 0.4015 | 5.98 | 0.4396 | 0.7456 | 118.87 | 0.0872 |
To further explore the generalization of LSM, we apply it to DiffGAN. The experimental results, as shown in the following table, strongly demonstrate that LSM can bring significant improvements across most metrics.
Model | FFE(↓) | Cos-speaker(↑) | mfccFID(↓) | melFID(↓) | mfccRecall(↑) | MCD(↓) | SSIM(↑) | mfccCOS(↑) | F0-RMSE(↓) | wer |
---|---|---|---|---|---|---|---|---|---|---|
ground_truth | 0.1427 | 0.9424 | 31.9789 | 3.4802 | 0.5644 | 4.567 | 0.8132 | 0.8457 | 89.2136 | 0.0412 |
diff-ganT2 | 0.3411 | 0.8333 | 38.6428 | 7.7855 | 0.3974 | 5.9437 | 0.461 | 0.7581 | 117.1919 | 0.0827 |
+LSM | 0.3397 | 0.8397 | 42.9622 | 7.9161 | 0.399 | 5.8576 | 0.458 | 0.7582 | 115.3769 | 0.072 |
diff-ganT4 | 0.3465 | 0.8358 | 37.1099 | 6.5823 | 0.3662 | 5.9425 | 0.4614 | 0.7571 | 120.0975 | 0.0751 |
+LSM | 0.34054 | 0.8403 | 43.8128 | 7.8876 | 0.387 | 5.8742 | 0.4641 | 0.759 | 115.8887 | 0.0704 |