/CM-TTS

[Findings of NAACL 2024] Source code of paper CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Primary LanguagePython

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Generated samples are accessible through the following link: Research showcase

Quickstart


DATASET refers to the names of datasets such as LibriTTS and VCTK in the following documents.


Dependencies

You can install the Python dependencies with:

pip3 install -r requirements.txt


Synthesize

You have to download the pretrained models(To prevent anonymity from leaking, we share the link later) and put them inoutput/pretrained_model/DATASET/CMDenoiserTTS/.

Synthesize by Single Text

Synthesize on VCTK. bash single_synthesize_vctk.sh

Synthesize on LJSpeech. bash single_synthesize_lj.sh

Synthesize on LibriTTS. bash single_synthesize_lib.sh

Synthesize by Single Batch

Synthesize on VCTK. bash synthesize_vctk.sh

Synthesize on LJSpeech. bash synthesize_lj.sh

Predict on LibriTTS. bash synthesize_lib.sh

Synthesize Zeroshot Sample

You can achieve zero-shot synthesis across datasets using the following approach:

Train on the LibriTTS dataset and predict on VCTK. bash synthesize_lib2vctk.sh

Train on the LibriTTS dataset and predict on LJSpeech. bash synthesize_lib2lj.sh

Training


Preprocessing Data

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

Before starting the processing, please check if .\config\DATASET\preprocess.yaml has been configured according to your preferences.

After completing the above preparations, you can accomplish data processing by running the corresponding script. The processing script is as follows:

LJSpeech: bash deal_data_Lj.sh

VCTK: bash deal_data_VCTK.sh

LibriTTS: bash deal_data_Lib.sh


Start Train

Before starting the training, please ensure that all the configurations under .\config\DATASET have been set according to your preferences, and the data has been processed as described above.

You can perform training on different datasets.

LJSpeech: python3 train_cm.py --model consistency_training --dataset LJSpeech

VCTK: python3 train_cm.py --model consistency_training --dataset VCTK

LibriTTS: python3 train_cm.py --model consistency_training --dataset LibriTTS

Supplementary Experiments

Experiment Description for MOS:In this supplementary experiment, we added the MOS metric test results for all models on the VCTK and LJSpeech datasets to Tables 1 and 2. For Table 3 and 4, we added the MOS test results for CM-TTS and DiffGAN-TTS models in a zero-shot scenario. For Table 5, we added the MOS metric test results for CM-TTS under different sampler settings. For Table 6, we added the MOS metric test results for CM-TTS under different loss settings. The specific results are as follows.

Tables 1:MOS on VCTK dataset

Models MOS
Reference (voc.) 4.5826(±0.1147)
FastSpeech2(300K) 3.6821(±0.1762)
VITS 3.6717(±0.0123)
DiffSpeech 2.9157(±0.0594)
DiffGAN-TTS(T=1) 3.4476(±0.1038)
DiffGAN-TTS(T=2) 3.6173(±0.1433)
DiffGAN-TTS(T=4) 3.6143(±0.1186)
CM-TTS(T=1) 3.9618(±0.0186)
CM-TTS(T=2) 3.8947(±0.0262)
CM-TTS(T=4) 3.8623(±0.0311)

Tables 2:MOS on LJSpeech dataset

Models MOS
Reference (voc.) 4.8667(±0.0315)
FastSpeech2(300K) 3.5742(±0.2309)
DiffSpeech 3.1668(±0.1378)
CoMoSpeech 3.5583(±0.2421)
VITS 3.6234(±0.0252)
DiffGAN-TTS(T=1) 3.7142(±0.1390)
DiffGAN-TTS(T=2) 3.6813(±0.0561)
DiffGAN-TTS(T=4) 3.7258(±0.0087)
CM-TTS(T=1) 3.8353(±0.0179)
CM-TTS(T=2) 3.7917(±0.1356)
CM-TTS(T=4) 3.7602(±0.1327)

Tables 3:MOS on VCTK under Zero-Shot Setting

Models MOS
Reference (voc.) 4.7467(±0.0194)
DiffGAN-TTS(T=1) 3.4607(±0.1880)
DiffGAN-TTS(T=2) 3.5067(±0.1573)
DiffGAN-TTS(T=4) 3.5893(±0.0298)
CM-TTS(T=1) 3.8715(±0.0896)
CM-TTS(T=2) 3.8387(±0.1521)
CM-TTS(T=4) 3.9221(±0.1016)

Tables 4:MOS on LJSpeech under Zero-Shot Setting

Models MOS
Reference (voc.) 4.8832(±0.0174)
DiffGAN-TTS(T=1) 3.6047(±0.1015)
DiffGAN-TTS(T=2) 3.6212(±0.0771)
DiffGAN-TTS(T=4) 3.7361(±0.1802)
CM-TTS(T=1) 3.7205(±0.1097)
CM-TTS(T=2) 3.6817(±0.1328)
CM-TTS(T=4) 3.7113(±0.1022)

Tables 5:MOS on VCTK with Different Samplers

Types MOS
Reference (voc.) 4.7172(±0.1236)
Uniform 3.8133(±0.0727)
Linear(↗) 3.3278(±0.0803)
Linear(↘) 3.5676(±0.1488)
LSM 3.9107(±0.1254)

Tables 6:MOS on VCTK with Different Loss

Types MOS
Reference (voc.) 4.6304(±0.1418)
L1 3.9052(±0.0415)
L1 (w/o padding) 3.8117(±0.1005)
L2 3.8726(±0.1971)
L2 (w/o padding) 3.8604(±0.1436)

Earlier implementations of the FastSpeech 2 model relied on directly importing checkpoints, possibly resulting in loading errors. We have now retrained the model and reassessed the relevant metrics. After carefully re-checking, the updated metrics are now available in the table below.

Tables 7:Updated Metrics for FastSpeech2 on VCTK and LJSpeech

Dataset FFE(↓) Cos-speaker(↑) mfccFID(↓) melFID(↓) mfccRecall(↑) MCD(↓) SSIM(↑) mfccCOS(↑) F0-RMSE(↓) wer
FastSpeech2(VCTK) 0.3503 0.8236 43.4236 8.8175 0.3554 5.8897 0.4537 0.7565 119.2076 0.0677
FastSpeech2(LJSpeech) 0.4877 0.8825 36.3090 5.2796 0.2121 6.1157 0.6468 0.7985 135.2583 0.0944

To verify the individual contributions of CT and LSM to the model's performance, we conducted ablation experiments by separately removing CT and LSM. The experimental results are presented below.

Tables 8:Ablation Study on VCTK (T=1)

Models FFE(↓) Cos-speaker(↑) mfccFID(↓) melFID(↓) mfccRecall(↑) MCD(↓) SSIM(↑) mfccCOS(↑) F0-RMSE(↓) WER(↓)
CM-T1 0.3387 0.8396 39.17 7.58 0.3946 5.91 0.4772 0.7599 119.29 0.0688
-CT 0.3364 0.835074 43.1316 10.74238 0.40103 5.9821 0.4626 0.7545 122.69101 0.0832
-LSM 0.3351 0.8333 56.31 10.08 0.4015 5.98 0.4396 0.7456 118.87 0.0872

To further explore the generalization of LSM, we apply it to DiffGAN. The experimental results, as shown in the following table, strongly demonstrate that LSM can bring significant improvements across most metrics.

Tables 9:DiffGAN with and without LSM

Model FFE(↓) Cos-speaker(↑) mfccFID(↓) melFID(↓) mfccRecall(↑) MCD(↓) SSIM(↑) mfccCOS(↑) F0-RMSE(↓) wer
ground_truth 0.1427 0.9424 31.9789 3.4802 0.5644 4.567 0.8132 0.8457 89.2136 0.0412
diff-ganT2 0.3411 0.8333 38.6428 7.7855 0.3974 5.9437 0.461 0.7581 117.1919 0.0827
+LSM 0.3397 0.8397 42.9622 7.9161 0.399 5.8576 0.458 0.7582 115.3769 0.072
diff-ganT4 0.3465 0.8358 37.1099 6.5823 0.3662 5.9425 0.4614 0.7571 120.0975 0.0751
+LSM 0.34054 0.8403 43.8128 7.8876 0.387 5.8742 0.4641 0.759 115.8887 0.0704