improve tts

Hi, me again.
I'm training your tts. My dataset is about 16 hours
First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:

Continue to train your acoustic_checkpoint to 1.46M step: val loss:0.227 and it's gonna converge.
- Here is my result:
Train from scratch: about 800k step - val loss: 0.301

Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing
Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step:
My transcript text: "xin chào tôi là phương anh bản thử số chín"

I got this : https://drive.google.com/file/d/1UtgE1gTC8mwo1SV1b7chauvWPC7uPjxM/view?usp=sharing
=> The result that speaker talk non-sense but intonation is quite good.
Here is 50k vocoder + 1.46M acoustic, just to compare:
https://drive.google.com/file/d/1InQ8ykYC_P7qaKhv_58SmTC0r-b_4_0h/view?usp=sharing
And from 50k vocoder + 800k from scratch: https://drive.google.com/file/d/1E-FjOfBqFf9vHTKXmAUhamtB2FsAlAMT/view?usp=sharing

I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ?
Thanks!

It seems to me that you are using a wrong lexicon file when generating speech.
The default scripts use the lexicon file assets/infore/lexicon.txt from Infore dataset, when working with your own dataset, you should replace it with your lexicon file.

I don't think so, in lexicon file just convert word to characters. I also record audio with text the same to the original dataset

In the function,

vietTTS/vietTTS/nat/data_loader.py

Line 11 in 346d467

def load_phonemes_set_from_lexicon_file(fn: Path):

we use the lexicon file to compute phoneme set, and use that set to compute phoneme index. A mismatched phoneme set at training and inference will cause problems.

We use this function at inference:

vietTTS/vietTTS/nat/text2mel.py

Line 33 in 346d467

def text2tokens(text, lexicon_fn):

and at training:

vietTTS/vietTTS/nat/data_loader.py

Line 56 in 346d467

phonemes = load_phonemes_set_from_lexicon_file(data_dir / 'lexicon.txt')

So, I have to train again or replace my lexicon file ?

My advice is to use the lexicon file that is used to train your model. Usually, it is at train_data/lexicon.txt

Generate speech with:

python3 -m vietTTS.synthesizer \
  --lexicon-file=train_data/lexicon.txt \
  --text="hôm qua em tới trường" \
  --output=clip.wav

I got confused, because when training I'm use your lexicon file (because my audio is the same to the orginal). I already change to your instruction, but result is the same

@Lethanhson9901, can you show a few lines in your train_data/lexicon.txt and an example *.textgrid file?

I suspect that there is a mismatch somewhere as your loss, mel-spectrogram seems alright to me.

https://drive.google.com/drive/folders/1l2DstG-l77AGvXpQdtayegOfA9X1yIwC?usp=sharing

@Lethanhson9901 Everything seems alright to me. I don't think I can help much in this case.

@Lethanhson9901
My advice:

train duration (10 minutes) and acoustic model (50 minutes) on InfoRE dataset and generate speech to make sure everything is working correctly (even though the speech is not good but it should be understandable )
train duration (10 minutes) and acoustic model (50 minutes) on Your dataset and generate speech (using your the pretrained hifigan). The speech should be understandable.

I'll try. Many thanks !

Hey, Thanks god, you're right. And it's work

B.t.w, if I want the voice is more natural and better, which part of training I should focus to ? Or pre-process data ? (like denoising, ...) and should I use audio augmentation in training ( by now I don't think it's work)

@Lethanhson9901
I'm not sure data augmentation can help. There are few things that can help:

A better dataset:
- Better recording voice: clear voice, correct pronunciation.
- Higher sample rate, 16k -> 22k -> 24k -> 48k
- More data: 20hours -> 40 hours -> 100 hours
- Better text transcripts: cover all Vietnamese phonemes with approximately equally phoneme's frequencies.
- Clean the dataset: make sure text and speech are perfectly matched.
Bigger model:
- If you have more data, use decoder RNNs with 1024 units (currently, it is 512 on InfoRE dataset)
Train acoustic model and Hifigan model longer.
Fine-tune Hifigan model with your acoustic model.

Hi @NTT123,
Could you please give me an instruction on how I can modify the decoder RNNs with 1024 units for my dataset?

@nampdn You will need to modify the file vietTTS/nat/config.py by setting acoustic_decoder_dim=1024

Fantastic, thank you!