MFA Runner for Beginners

A simple tool to easily use Montreal Forced Aligner.

Description

These days, as speech research community rapidly grows, text-wav forced alignment is necessary to the research such as Text-to-Speech, Voice Conversion and other speech-related search field. One simple and widely-used approach is to use Montreal Forced Aligner(MFA) [McAuliffe17] as text-wav forced aligner. Despite of lots of necessity, some speech-research beginners may feel that it is hard to train their custom dataset. For them, this repository offers following operations and procedures that are needed to run MFA with little efforts.

Make whole dataset to (wav, lab) pair-formatted structure which is necessary to run MFA
Generate phoneme dictionary from pretranied G2P model provided from official MFA documents
Train MFA using pre-formatted (wav-lab) paired dataset and generated phoneme dictionary
Validate and visualize the extracted TextGrid(alignment) via jupyter notebook
Provide the text-wav alignment retrieved from Emotional Speech Dataset (ESD) [Zhou21]

How to use

To run this program, please follow the procedure below.

Install anaconda and python=3.9.
Install MFA and download ESD dataset
Install pre-requisite modules using pip via following command pip install -r requirements.txt
Edit config.py to point your database.
Run formatter python main.py

Alignments

As a result of this tutorial, I upload text-wav alignment extracted using MFA.

Emotional Speech Dataset (ESD) [Zhou21] [download]

Visualization of Extracted Alignments

Please refer visualise_alignment.ipynb.

Supported Dataset

Emotional Speech Dataset (ESD) [Zhou21] - an English multispeaker-multiemotion dataset
[Korean Single Speaker Dataset] (KSS) - a Korean single female speaker dataset
감정 음성합성 데이터셋(Korean Emotional Speech dataset) - a Korean single speaker multi-emotion dataset
EmotionTTS OpenDB - a multipurpose dataset (in this repos, only consider multispeaker-multiemotion dataset)

Experimental Notes

currently, only supports ESD
Different emotions belonging to a single speaker are considered independently. (i.e., utterances with emotion 'Angry' and utterances with emotion 'Sad' from same speaker are treated with different speakers.) This is a simple "remedy" to reduce complexity of style(emotion) distribution.
Please note that extracted alignments may not be accurate.
Regarding ESD dataset, only English speakers are used.

Contacts

Please email to mskang1478@gmail.com. Any suggestion or question be appreciated. Hope that this repository be helpful.

ericbang/MFARunner