
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan

Primary LanguagePython

Turkish Text-to-Speech

Table Of Contents


This repository contains a Dockerfile that extends the PyTorch 21.02-py3 NGC container and encapsulates some dependencies. To create your own container, choose a PyTorch container from NVIDIA PyTorch Container Versions and create a Dockerfile as following format:

FROM nvcr.io/nvidia/pytorch:21.02-py3
WORKDIR /path/to/working/directory/text2speech/
COPY requirements.txt .
RUN pip install -r requirements.txt
  1. Build and run docker

Go to the /path/to/working/directory/text2speech/docker

$ docker build --no-cache -t torcht2s .
$ docker run -it --rm --gpus all -p 2222:8888 -v /path/to/working/directory/text2speech:/path/to/working/directory/text2speech torcht2s
  1. Add environment to jupyter notebook and launch jupyter notebook
$ python -m ipykernel install --user --name=torcht2s
$ jupyter notebook --ip= --port=8888 --no-browser --allow-root
  1. Open a browser from your local machine and navigate to${TOKEN} and enter your token specified in your terminal.

Text Preprocessing (Phonetical Conversion and Normalization for Turkish)

In order to train speech synthesis models, sounds and phoneme sequences expressing sounds are needed. That's wyh in the first step, the input text is encoded into a list of symbols. In this study, we will use Turkish characters and phonemes as the symbols. Since Turkish is a phonetic language, words are expressed as they are read. That is, character sequences are constructed words in Turkish. In non-phonetic languages such as English, words can be expressed with phonemes. To synthesize Turkish speech with English data, the words in the English dataset first must be phonetically translated into Turkish.

  • In this study, cmudict_tr and heteronyms_tr were used. CMUDict (Turkish phonetic lexicon) is a dictionary that phonetically expresses about 1.5M words in Turkish.
  • The following phonemes represent the Turkish pronunciation of the phonemes.
valid_symbols = ['1', '1:', '2', '2:', '5', 'a', 'a:', 'b', 'c', 'd', 'dZ', 'e', 'e:', 'f', 'g', 'gj', 'h', 'i', 'i:', 'j',
  'k', 'l', 'm', 'n', 'N', 'o', 'o:', 'p', 'r', 's', 'S', 't', 'tS', 'u', 'u', 'v', 'y', 'y:', 'z', 'Z']
  • Text normalization converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech synthesis. It ensures that TTS can handle all input texts without skipping unknown symbols. Text normalization is applied for Turkish utterances.

Data Preperation

To speed-up training, those could be generated during the pre-processing step and read directly from the disk during training. Follow these steps to use custom dataset.

  1. Prepare a directory with .wav files, filelists (training/validation split of the data) with transcripts and paths to .wav files under the text2speech/Fastpitch/dataset/ location. Those filelists should list a single utterance per line as:
<audio file path>|<transcript>
  1. Run the pre-processing script to calculate pitch and mels with text2speech/Fastpitch/data_preperation.ipynb
$ python prepare_dataset.py \ 
    --wav-text-filelists dataset/tts_data.txt \ 
    --n-workers 16 \
    --batch-size 1 \
    --dataset-path dataset \
    --extract-pitch \
    --f0-method pyin \
    --extract-mels \
  1. Prepare file lists with paths to pre-calculated pitch running create_picth_text_file(manifest_path) from text2speech/Fastpitch/data_preperation.ipynb Those filelists should list a single utterance per line as:
<mel or wav file path>|<pitch file path>|<text>|<speaker_id>

The complete dataset has the following structure:

├── mels
├── pitch
├── wavs
├── tts_data.txt  # train + val
├── tts_data_train.txt
├── tts_data_val.txt
├── tts_pitch_data.txt  # train + val
├── tts_pitch_data_train.txt
├── tts_pitch_data_val.txt

Training Fastpitch from scratch (Spectrogram Generator)

The training will produce a FastPitch model capable of generating mel-spectrograms from raw text. It will be serialized as a single .pt checkpoint file, along with a series of intermediate checkpoints.

$ python train.py --cuda --amp --p-arpabet 1.0 --dataset-path dataset \ 
                --output saved_fastpicth_models/ \
                --training-files dataset/tts_pitch_data_train.txt \ 
                --validation-files dataset/tts_pitch_data_val.txt \ 
                --epochs 1000 --learning-rate 0.001 --batch-size 32 \

Fine-tuning the model with HiFi-GAN

The last step is converting the spectrogram into the waveform. The process to generate speech from spectrogram is also called Vocoder.

Some mel-spectrogram generators are prone to model bias. As the spectrograms differ from the true data on which HiFi-GAN was trained, the quality of the generated audio might suffer. In order to overcome this problem, a HiFi-GAN model can be fine-tuned on the outputs of a particular mel-spectrogram generator in order to adapt to this bias. In this section we will perform fine-tuning to FastPitch outputs.

  1. Generate mel-spectrograms for all utterances in the dataset with the FastPitch model
  • Copy best-performed FastPitch output .pt file in the text2speech/Hifigan/data/pretrained_fastpicth_model/ directory.
  • Copy manifest file tts_pitch_data.txt in the text2speech/Hifigan/data/ directory.
$ python extract_mels.py --cuda 
    -o data/mels-fastpitch-tr22khz \ 
    --dataset-path /text2speech/Fastpitch/dataset \
    --dataset-files data/tts_pitch_data.txt  # train + val 
    --load-pitch-from-disk \
    --checkpoint-path data/pretrained_fastpicth_model/FastPitch_checkpoint.pt -bs 16

Mel-spectrograms should now be prepared in the text2speech/Hifigan/data/mels-fastpitch-tr22khz directory. The fine-tuning script will load an existing HiFi-GAN model and run several epochs of training using spectrograms generated in the last step.

  1. Fine-tune the Fastpitch model with HiFi-GAN

This step will produce another .pt HiFi-GAN model checkpoint file fine-tuned to the particular FastPitch model.

  • Open a new folder results in the text2speech/Hifigan directory.
$ nohup python train.py --cuda --output /results/hifigan_tr22khz \
 --epochs 1000 --dataset_path /Fastpitch/dataset \
 --input_mels_dir /data/mels-fastpitch-tr22khz \
 --training_files /Fastpitch/dataset/tts_data.txt \
 --validation_files /Fastpitch/dataset/tts_data.txt \
 --fine_tuning --fine_tune_lr_factor 3 --batch_size 16 \ 
 --learning_rate 0.0003 --lr_decay 0.9998 --validation_interval 10 > log.txt
  1. Open another terminal and track log as following
$ tail -f log.txt 


Run the following command to synthesize audio from raw text with mel-spectrogram generator

python inference.py --cuda \
  --hifigan /Hifigan/results/hifigan_tr22khz/hifigan_gen_checkpoint.pt \
  --fastpitch /Fastpitch/saved_fastpicth_models/FastPitch_checkpoint.pt \
  -i test_text.txt \
  -o wavs/

The speech is generated from a file passed with the -i argument. The output audio will be stored in the path specified by the -o argument.