/KazEmoTTS

An open-source Kazakh Emotional Text-to-Speech Dataset

Primary LanguagePython

KazEmoTTS
โŒจ๏ธ ๐Ÿ˜ ๐Ÿ˜  ๐Ÿ™‚ ๐Ÿ˜ž ๐Ÿ˜ฑ ๐Ÿ˜ฎ ๐Ÿ—ฃ

GitHub stars GitHub issues ISSAI Official Website

This repository provides a dataset and a text-to-speech (TTS) model for the paper
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Dataset Statistics ๐Ÿ“Š

Emotion # recordings Narrator F1 Narrator M1 Narrator M2
Total (h) Mean (s) Min (s) Max (s) Total (h) Mean (s) Min (s) Max (s) Total (h) Mean (s) Min (s) Max (s)
neutral 9,385 5.85 5.03 1.03 15.51 4.54 4.77 0.84 16.18 2.30 4.69 1.02 15.81
angry 9,059 5.44 4.78 1.11 14.09 4.27 4.75 0.93 17.03 2.31 4.81 1.02 15.67
happy 9,059 5.77 5.09 1.07 15.33 4.43 4.85 0.98 15.56 2.23 4.74 1.09 15.25
sad 8,980 5.60 5.04 1.11 15.21 4.62 5.13 0.72 18.00 2.65 5.52 1.16 18.16
scared 9,098 5.66 4.96 1.00 15.67 4.13 4.51 0.65 16.11 2.34 4.96 1.07 14.49
surprised 9,179 5.91 5.09 1.09 14.56 4.52 4.92 0.81 17.67 2.28 4.87 1.04 15.81
Narrator # recordings Duration (h)
F1 24,656 34.23
M1 19,802 26.51
M2 10,302 14.11
Total 54,760 74.85

Installation ๐Ÿ› ๏ธ

First, you need to build the monotonic_align code:

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Note: Python version is 3.9.13

Pre-Processing Data for Training ๐Ÿงน

You need to download the KazEmoTTS dataset and customize it, as in filelists/all_spk, by executing data_preparation.py:

python data_preparation.py -d <path_to_KazEmoTTS_dataset>

Training Stage ๐Ÿ‹๏ธโ€โ™‚๏ธ

To initiate the training process, you must specify the path to the model configurations, which can be found in configs/train_grad.json, and designate a directory for checkpoints, typically located at logs/train_logs, to specify the GPU you will be using.

CUDA_VISIBLE_DEVICES=YOUR_GPU_ID
python train_EMA.py -c <configs/train_grad.json> -m <checkpoint>

Inference ๐Ÿง 

Pre-Training Stage ๐Ÿƒ

If you intend to utilize a pre-trained model, you will need to download the necessary checkpoints TTS, vocoder for both the TTS model based on GradTTS and HiFi-GAN.

To conduct inference, follow these steps:

  • Create a text file containing the sentences you wish to synthesize, such as filelists/inference_generated.txt.
  • Specify the txt file format as follows: text|emotion id|speaker id.
  • Adjust the path to the HiFi-Gan checkpoint in inference_EMA.py.
  • Set the classifier guidance level to 100 using the -g parameter.
python inference_EMA.py -c <config> -m <checkpoint> -t <number-of-timesteps> -g <guidance-level> -f <path-for-text> -r <path-to-save-audios>

Synthesized samples ๐Ÿ”ˆ

You can listen to some synthesized samples here.

Citation ๐ŸŽ“

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{abilbekov2024kazemotts,
      title={KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis}, 
      author={Adal Abilbekov and Saida Mussakhojayeva and Rustem Yeshpanov and Huseyin Atakan Varol},
      year={2024},
      eprint={2404.01033},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}