/hmm-for-emo-tts

:computer: A repository with comprehensive instructions for using the Festvox toolkit for generating Emotional speech :speaker: from text

Primary LanguageCSSMIT LicenseMIT

HMM-based Emotional Text-to-speech

DOI

A repository with comprehensive instructions for using the Festvox toolkit for generating emotional speech from text. This was done as a part of a course project for Speech Recognition and Understanding (ECE557/CSE5SRU) at IIIT Delhi during Winter 2020.

demo


Contents

Dataset

Dataset No. of Speakers Emotions No. of utterances No. of unique prompts Duration Language Comments Pros Cons
TESS 2 (2 female) 7 (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral) 2800 200 ~2 hours English
  • Consists of sentences of the form "Say the word ___" where the blank is filled by a unique word
  • Each speaker has 200 new words spoken after the base sentence, for the 7 different emotions
  • Easily available
  • Emotions contained are very easy to interpret
  • Very limited utterances
  • Same base utterance leading to redundancy
EmoV-DB 5 (3 male, 2 female) 5 (neutral, amused, angry sleepy, disgust) 6914 (1568, 1315, 1293, 1720, 1018) 1150 ~7 hours English, French (1 male speaker)
  • The Amused emotion contains non-verbal cues like chuckling, etc. which do not show up in the transcript
  • Similarly, Sleepiness has yawning sounds.
  • Only large scale emotional corpus that we found freely available
  • Emotions covered are not very easy to interpret
  • The non-verbal cues make synthesis difficult
  • Also, not all emotions are available for all speakers

Approach

❌ Approach 1: Using HTS toolkit for building emotional speech

The HTS Toolkit is a go-to first step for HMM-based speech synthesis methods. We came across a lot of work which made use of HMM techniques to generate speech, which then referred to HTS for their implementation (this paper, this detailed lecture and this beginner's guide were extremely helpful)

Observations

  • Even with the help of the HTS documentation, using and setting up HTS is not a cake-walk (which led us to build this README for a more structured approach) and due to the vast amount of parameteres to set, it gets extremely overwhelming for a beginner.
  • When attempting to write the models from scratch, most of the techniques described in the papers above are incremental buildups of several other works, which was hard to trace and thus, implement

❌ Approach 2: Using Festvox on the TESS Dataset

The next step was to try the Festvox Toolkit. We tried it on the TESS Dataset as detailed above.

Observation

  • Even though we were able to setup the HMM Toolkit, the TESS Dataset has repeated base utterances - "Say the word", followed by a unique word
  • After the "Say the word", the model would find it difficult to utter the next word.
  • Models are able to capture (different) emotion and expressive levels to some degree, but seem to be falling short on the vocabulary, so the next step would be to train it on a larger emotional corpus with a richer vocabulary like EmoV-DB

✅ Approach 3: Using Festvox on the EmoV-DB

The steps followed are documented in the following flowchart -

Flowchart

The EmoV-DB dataset was formatted in the format given in this section. Further details about training from scratch is given here.

Training your own HMM models

Festvox project is part of the work at Carnegie Mellon University's speech group aimed at advancing the state of Speech Synthesis.

We will be using Festvox to train our HMM models and build voices.

Requirements

  • Docker
  • Audio Files: The audio files to be used for training.
  • File with utterances: A file which contains the path to the audio file and their transcripts. Schema is described below.

Setup

Docker Image

An already configured Docker Image is created by mjansche for the Text-to-Speech tutorial at SLTU 2016. We will be training our HMM models using this Docker Image.

The Docker Image can be pulled by

docker pull mjansche/tts-tutorial-sltu2016

After pulling the docker image, we need to setup flite which is an open source small fast run-time text to speech engine. To setup flite, run the docker image and once in the directory /usr/local/src run the following commands

git clone https://github.com/festvox/flite.git
cd flite
./configure
make

Audio Files

The training requires PCM encoded 16bit mono wav audio files with a sampling rate of 16kHz. Please use ffmpeg to convert the recorded audio files to the correct format by running the following

ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav

File with Utterances

For training you need to make a file named txt.done.data with the base filenames of all the utterances and the text of each utterance. e.g.

( audio_0001 "a whole joy was reaping." )
( audio_0002 "but they've gone south." )
( audio_0003 "you should fetch azure mike." )

Caution There is a space after/before the round braces and between the file name and the utterance. The utterance must be in double quotes.

Training

Preparing the Directory

The first step to train HMM is to prepare the directory. After running the docker image,

cd /usr/local/src/festvox/src/clustergen
mkdir cmu_us_ss
cd cmu_us_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu us ss

Instead of "cmu" and "ss" you can pick any names you want, but please keep "us" so that Festival knows to use the US English pronunciation dictionary. For indic voices, use "indic" instead of "us".

Synthesis of Audio Files

Assuming that you have already prepared the audio files and the list of utterances,

cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/

Assuming the recordings might not be as good as the could be you can power normalize them.

./bin/get_wavs recording/*.wav

Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by

./bin/prune_silence wav/*.wav

Note: If you do not require these three stages, you can put your wavefiles directly into wav/

Building Voices

For building voices, you can use an automated script that will do the feature extraction, build the models and generate some text examples.

./bin/build_cg_rfs_voice

Manual build

Firsty build the prompts and label the data.

./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters

Then do feature extraction

./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v

Build the models

./bin/traintest etc/txt.done.data
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train

Generating Voices

We will use flite to generate audio from the trained model.

rm -rf flite
$FLITEDIR/tools/setup_flite
./bin/build_flite cg
cd flite
make

flite requires .flitevox object to build the voices. Create the .flitevox object by

./flite_cmu_us_${NAME} -voicedump output.flitevox

Then audio can be easily generated for any utterance by

./flite_cmu_us_${NAME} "<sentence to utter>" output.wav

Demonstration

We also make our system demonstration publicaly available within the hmm_wrapper directory. Further details are provided in the README of the directory.

Trained Models

We also make the trained models for the different emotions available here.

These models can be used for further fine-tuning or running the system provided in hmm_wrapper directory.

References

Festvox : Festvox project developed by Carnegie Mellon University.
Docker : Festvox configured docker image.
Building Data : The format for utterance file.
Training : Steps to train the HMM Model.
Automated Script : Description of the automated script.

Cite

If you find any of the approches or code in this repository useful, please consider citing this repository:

@software{pranav_jain_2020_3876162,
  author       = {Pranav Jain and
                  Srija Anand and
                  Eshita and
                  Shruti Singh and
                  Aditya Chetan and
                  Brihi Joshi and
                  Pulkit Madaan},
  title        = {{An exploration into HMM-based methods for 
                   Emotional Text-to-Speech}},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v1.0.0},
  doi          = {10.5281/zenodo.3876162},
  url          = {https://doi.org/10.5281/zenodo.3876162}
}

Contact

For any errors or help in running the project, please open an issue or write to any of the project members -

  • Pranav Jain (pranav16255 [at] iiitd [dot] ac [dot] in)
  • Srija Anand (srija17199 [at] iiitd [dot] ac [dot] in)
  • Eshita (eshita17149 [at] iiitd [dot] ac [dot] in)
  • Shruti Singh (shruti17211 [at] iiitd [dot] ac [dot] in)
  • Pulkit Madaan (pulkit16257 [at] iiitd [dot] ac [dot] in)
  • Aditya Chetan (aditya16217 [at] iiitd [dot] ac [dot] in)
  • Brihi Joshi (brihi16142 [at] iiitd [dot] ac [dot] in)