/DurIAN

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DurIAN

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Status: released

1 Info

DurIAN is encoder-decoder architecture for text-to-speech synthesis task. Unlike prior architectures like Tacotron 2 it doesn't learn attention mechanism but takes into account phoneme durations information. So, of course, to use this model one should have phonemized and duration-aligned dataset. However, you may try to use pretrained duration model on LJSpeech dataset (CMU dict used). Links will be provided below.

2 Architecture details

DurIAN model consists of two modules: backbone synthesizer and duration predictor. Here are some of the most notable differences from DurIAN described in paper:

  • Prosodic boundary markers aren't used (didn't have them labeled), and thus there's no 'skip states' exclusion of prosodic boundaries' hidden states
  • Style codes aren't used too (same reason)
  • Removed Prenet before CBHG encoder (didn't improved accuracy during experiments)
  • Decoder's recurrent cell outputs single spectrogram frame at a time

Both backbone synthesizer and duration model are trained simultaneously. For implementation simplifications duration model predicts alignment over the fixed max number of frames. You can learn this outputs as BCE problem, MSE problem by summing over frames-axis or to use both losses (haven't tested this one), set it in the config.json. Experiments showed that just-BCE version of optimization process showed itself being unstable with longer text sequences, so prefer using MSE+BCE or just-MSE (don't mind if you get bad alignments in Tensorboard).

3 Reproducibility

You can check the synthesis demo wavfile (was obtained much before convergence) in demo folder (used Waveglow vocoder).

  1. First of all, make sure you have installed all packages using pip install --upgrade -r requirements.txt. The code is tested using pytorch==1.5.0

  2. Clone the repository: git clone https://github.com/ivanvovk/DurrIAN

  3. To start training paper-based DurIAN version run python train.py -c configs/default.json. You can specify to train baseline model as python train.py -c configs/baseline.json --baseline

To make sure that everything works fine at your local environment you may run unit tests in tests folder by python <test_you_want_to_run.py>.

4 Pretrained models

This implementation was trained using phonemized duration-aligned LJSpeech dataset with BCE duration loss minimization. You may find it via this link.

5 Dataset alignment problem

The main drawback of this model is requiring of duration-aligned dataset. You can find parsed LJSpeech filelist used in the training of current implementation in filelists folder. In order to use your data, make sure you have organized your filelists in the same way as provided LJSpeech ones. However, in order to save time and neurons of your brains you may try to train the model on your dataset without duration-aligning using the pretrained on LJSpeech duration model from my model checkpoint (didn't tried). But if you are interested in aligning personal dataset, carefully follow the next section.

6 How to align your own data

In my experiments I aligned LJSpeech with Montreal Forced Alignment tool. If here something will be unclear, please, follow instructions in toolkit's docs. To begin with, aligning algorithm has several steps:

  1. Organize your dataset properly. MFA requires it to be in a single folder of structure {utterance_id.lab, utterance_id.wav}. Make sure all your texts are of .lab format.

  2. Download MFA release and follow installation instructions via this link.

  3. Once done with MFA, you need your dataset words dictionary with phonemes transcriptions. Here you have several options:

    1. (Try this first) Download already done dictionary from MFA pretrained models list (at the bottom of the page). In current implementation I have used English Arpabet dictionary. Here can be a problem: if your dataset contains some words missing in the dictionary, MFA may fail to parse it in the future and skip such dataset files. You may skip them or try to preprocess your dataset with accordance to the dictionary or add missing words by hand (if not too much of them).
    2. You may generate the dictionary with pretrained G2P model from MFA pretrained models list using the command bin/mfa_generate_dictionary /path/to/model_g2p.zip /path/to/data dict.txt. Notice, that default MFA installation will automatically provide you with English pretrained model, which you may use.
    3. In other cases, you'll need to train your own G2P model on your data. In order to train your model follow instructions via this link.
  4. Once you have your data prepared, dictionary and G2P model, now you are ready for aligning. Run the command bin/mfa_align /path/to/data dict.txt path/to/model_g2p.zip outdir. Wait until done. outdir folder will contain a list of out of vocabulary words and a folder with special files of .TextGrid format, where wavs alignments are stored.

  5. Now we want to process these text grid files in order to get the final filelist. Here you may find useful the python package TextGrid. Install it using pip install TextGrid. Here an example how to use it:

    import textgrid
    tg = textgrid.TextGrid.fromFile('./outdir/data/text0.TextGrid')
    

    Now tg is the set two objects: first one contains aligned words, second one contains aligned phonemes. You need the second one. Extract durations (in frames! tg has intervals in seconds, thus convert) for whole dataset by iterating over obtained .TextGrid files and prepare a filelist in same format as the ones I provided in filelists folder.

I found an overview of several aligners. Maybe it will be helpful. However, I recommend you to use MFA as it is one of the most accurate aligners, to my best knowledge.