improve tts
lethanhson9901 opened this issue · 17 comments
Hi, me again.
I'm training your tts. My dataset is about 16 hours
First, because my dataset utterance is similar to yours, I'm training acoustic model use 2 approach:
-
Continue to train your acoustic_checkpoint to 1.46M step: val loss:0.227 and it's gonna converge.
Here is full detail : https://drive.google.com/drive/folders/1j0OT7KgJOk5hmcOVNPdcdkaekRRxHekk?usp=sharing
Second, I train Hifigan Vocoder (with acoustic 1.46M) about 290k step:
My transcript text: "xin chào tôi là phương anh bản thử số chín"
-
I got this : https://drive.google.com/file/d/1UtgE1gTC8mwo1SV1b7chauvWPC7uPjxM/view?usp=sharing
=> The result that speaker talk non-sense but intonation is quite good. -
Here is 50k vocoder + 1.46M acoustic, just to compare:
https://drive.google.com/file/d/1InQ8ykYC_P7qaKhv_58SmTC0r-b_4_0h/view?usp=sharing -
And from 50k vocoder + 800k from scratch: https://drive.google.com/file/d/1E-FjOfBqFf9vHTKXmAUhamtB2FsAlAMT/view?usp=sharing
I got stuck, should I focus on acoustic or vocoder or dataset to improve the result ?
Thanks!
It seems to me that you are using a wrong lexicon file when generating speech.
The default scripts use the lexicon file assets/infore/lexicon.txt
from Infore dataset, when working with your own dataset, you should replace it with your lexicon file.
I don't think so, in lexicon file just convert word to characters. I also record audio with text the same to the original dataset
In the function,
vietTTS/vietTTS/nat/data_loader.py
Line 11 in 346d467
we use the lexicon file to compute phoneme set, and use that set to compute phoneme index. A mismatched phoneme set at training and inference will cause problems.
We use this function at inference:
vietTTS/vietTTS/nat/text2mel.py
Line 33 in 346d467
and at training:
vietTTS/vietTTS/nat/data_loader.py
Line 56 in 346d467
So, I have to train again or replace my lexicon file ?
My advice is to use the lexicon file that is used to train your model. Usually, it is at train_data/lexicon.txt
Generate speech with:
python3 -m vietTTS.synthesizer \
--lexicon-file=train_data/lexicon.txt \
--text="hôm qua em tới trường" \
--output=clip.wav
I got confused, because when training I'm use your lexicon file (because my audio is the same to the orginal). I already change to your instruction, but result is the same
@Lethanhson9901, can you show a few lines in your train_data/lexicon.txt
and an example *.textgrid file?
I suspect that there is a mismatch somewhere as your loss, mel-spectrogram seems alright to me.
@Lethanhson9901 Everything seems alright to me. I don't think I can help much in this case.
@Lethanhson9901
My advice:
-
train duration (10 minutes) and acoustic model (50 minutes) on InfoRE dataset and generate speech to make sure everything is working correctly (even though the speech is not good but it should be understandable )
-
train duration (10 minutes) and acoustic model (50 minutes) on Your dataset and generate speech (using your the pretrained hifigan). The speech should be understandable.
I'll try. Many thanks !
Hey, Thanks god, you're right. And it's work
B.t.w, if I want the voice is more natural and better, which part of training I should focus to ? Or pre-process data ? (like denoising, ...) and should I use audio augmentation in training ( by now I don't think it's work)
@Lethanhson9901
I'm not sure data augmentation can help. There are few things that can help:
- A better dataset:
- Better recording voice: clear voice, correct pronunciation.
- Higher sample rate, 16k -> 22k -> 24k -> 48k
- More data: 20hours -> 40 hours -> 100 hours
- Better text transcripts: cover all Vietnamese phonemes with approximately equally phoneme's frequencies.
- Clean the dataset: make sure text and speech are perfectly matched.
- Bigger model:
- If you have more data, use decoder RNNs with 1024 units (currently, it is 512 on InfoRE dataset)
- Train acoustic model and Hifigan model longer.
- Fine-tune Hifigan model with your acoustic model.
Hi @NTT123,
Could you please give me an instruction on how I can modify the decoder RNNs with 1024 units for my dataset?
@nampdn You will need to modify the file vietTTS/nat/config.py
by setting acoustic_decoder_dim=1024
Fantastic, thank you!