Official PyTorch Implementation of FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS. Codes are based on the Acknowledgements below.
Abstract: In this paper, we propose a method to flexibly control the local prosodic variation of a neural text-to-speech (TTS) model. To provide expressiveness for synthesized speech, conventional TTS models utilize utterance-wise global style embeddings that are obtained by compressing frame-level embeddings along the time axis. However, since utterance-wise global features do not contain sufficient information to represent the characteristics of word-level local features, they are not appropriate for direct use on controlling prosody at a fine scale.
In multi-style TTS models, it is very important to have the capability to control local prosody because it plays a key role in finding the most appropriate text-to-speech pair among many one-to-many mapping candidates.
To explicitly present local prosodic characteristics to the contextual information of the corresponding input text, we propose a module to predict the fundamental frequency (
Visit our Demo for audio samples.
- Clone this repository
- Install python requirements. Please refer requirements.txt
- Like Code reference, please modify return values of torch.nn.funtional.multi_head_attention.forward() to draw attention of all head in the validation step.
#Before return attn_output, attn_output_weights.sum(dim=1) / num_heads #After return attn_output, attn_output_weights
-
Prepare text preprocessing
1-1. Our codes are used for internal Korean dataset. If you run the code with another languages, please modify files in text and hparams.py that are related to symbols and text preprocessing.
1-2. Make data filelists like format of filelists/example_filelist.txt. They used for preprocessing and training.
/your/data/path/angry_f_1234.wav|your_data_text|speaker_type /your/data/path/happy_m_5678.wav|your_data_text|speaker_type /your/data/path/sadness_f_111.wav|your_data_test|speaker_type ...
1-3. For finding the number of speaker and emotion and defining file names to save, we used format of filelists/example_filelist.txt. Thus, please modify the data-specific part (annotated) in utils/data_utils.py, extract_emb.py, mean_i2i.py and inference.py
1-4. Like 1-3., we implemented emotion classification loss based on the format of data. You can use classification loss as nn.CrossEntropyLoss() instead.
-
Preprocessing
2-1. Before run preprocess.py, modify path (data path) and file_path (filelist that you make in 1-2.) in the line 21 , 25.
2-2. Run
python preprocess.py
2-3. Modify path of data, train and validation filelist hparams.py
python train.py -o [SAVE DIRECTORY PATH] -m [BASE OR PROP]
(Arguments)
-c: Ckpt path for loading
-o: Path for saving ckpt and log
-m: Choose baseline or proposed model
-
Mean (i2i) style embedding extraction (optional)
0-1. Extract emotion embeddings of dataset
python extract_emb.py -o [SAVE DIRECTORY PATH] -c [CHECKPOINT PATH] -m [BASE OR PROP]
(Arguments)
-o: Path for saving emotion embs -c: Ckpt path for loading -m: Choose baseline or proposed model
0-2. Compute mean (or I2I) embs
python mean_i2i.py -i [EXTRACED EMB PATH] -o [SAVE DIRECTORY PATH] -m [NEU OR ALL]
(Arguments)
-i: Path of saved emotion embs -o: Path for saving mean or i2i embs -m: Set the farthest emotion as only neutral or other emotions (explained in mean_i2i.py)
-
Inference
python inference.py -c [CHECKPOINT PATH] -v [VOCODER PATH] -s [MEAN EMB PATH] -o [SAVE DIRECTORY PATH] -m [BASE OR PROP]
(Arguments)
-c: Ckpt path of acoustic model -v: Ckpt path of vocoder (Hifi-GAN) -s (optional): Path of saved mean (i2i) embs -o: Path for saving generated wavs -m: Choose baseline or proposed model --control (optional): F0 controal at the utterance or phoneme-level --hz (optional): values to modify F0 --ref_dir (optional): Path of reference wavs. Use when you do not apply mean (i2i) algs. --spk (optional): Use with --ref_dir --emo (optional): Use with --ref_dir
We refered to the following codes for official version of implementation.
- NVIDIA/tacotron2: Link
- Deepest-Project/Transformer-TTS: Link
- NVIDIA/FastPitch: Link
- KevinMIN95/StyleSpeech: Link
- Kyubong/g2pK: Link
- jik876/hifi-gan: Link
- KinglittleQ/GST-Tacotron: Link
@article{kim2022fluenttts,
title={FluentTTS: Text-dependent Fine-grained Style Control for Multi-style TTS$\}$$\}$},
author={Kim, Changhwan and Um, Se-yun and Yoon, Hyungchan and Kang, Hong-Goo},
journal={Proc. Interspeech 2022},
pages={4561--4565},
year={2022}
}