Forward Tacotron Model

Question

Forward Tacotron Model

Closed this issue 2 years ago · 18 comments

Hi Thorsten,

fist of all thanks for the nice dataset. Out of curiosity I was training a FastPitch model on it using my repo: https://github.com/as-ideas/ForwardTacotron

Here is a sample vocoded with the pretarained universal HiFi-GAN model from here.

FastPitch Sample

Coqui Sample for comparison.

Let me know if you are interested in the model, I can make it public and probably fine tune the HiFi-GAN model as well.

Answer 1 · 2021-09-29T14:33:52.000Z

Hi @cschaefer26

I'm really impressed as i think it's sounding great. 👏 👍

Do you have a website, post or articles where i can link to? Adding a "Thorsten" phrase/sample here or here would really be great.

Did you train on the "stable" dataset or on the "recording-in-progress" dataset with 8k recordings?

Making your model/config public would be really nice?

Thanks for your effort and sharing. 😄

Answer 2 · 2021-09-29T15:09:57.000Z

Hi glad you like it. I trained on the stable dataset. I just read the 8k recordings are of higher quality? Ill try those out. I will play around a bit with the settings to see if i can increase the quality a bit and share the best model.

Answer 3 · 2021-10-01T07:50:02.000Z

Hi, so I trained a new model on the 8k dataset. Impression is that the prosody is more consistent leading to a quicker build up of attention for the Tacotron model (which i need for extracting durations for the fast pitch model). Here are samples for comparison (vocoded with universal hifi-gan):

fastpitch old dataset
fastpitch new dataset
coqui

Here is the model in case you want to play around with it: link

I also linked it in the repo: https://github.com/as-ideas/ForwardTacotron#pretrained-models (follow the README if you want to use the model)

Answer 4 · 2021-10-01T19:00:53.000Z

Thanks again @cschaefer26 for playing around with my recording-in-progress dataset and adding to README.
As i am a little bit "betriebstaub" what is your opinion on naturalness of speechflow in these two versions?

I'd say the new dataset has a more natural flow but synthesized audio has same length as version with old dataset.

My first recordings have been recorded too fast (23 chars/sec) and i try to speak an average around 16 characters per second.
See my video here: https://youtu.be/mlsYnDw71vc?t=523

Answer 5 · 2021-10-04T06:48:54.000Z

Hi, I didn't manually check the whole dataset, so it is more of a guess. Here is a ground truth example of a speed fluctuation (the word 'Solidarität' is quicker than the rest:

https://drive.google.com/file/d/1eEAyT91LLQLBc-5OtSJ4IHA8soBNQVrz/view?usp=sharing

In my experience, its best for the ML training if the speed is pretty constant - its probably some kind of double edged sword to be consistent but not robotic...

Answer 6 · 2021-10-05T20:26:19.000Z

I know what you mean. Speaking in a very consistent speech pace can sound bored. Speaking too enthusiastic might be unuseable for TTS training. Finding the right value is not too easy.

Answer 7 · 2021-10-06T06:50:21.000Z

Absolutely, maybe the tts models need to get better :-)

One more thing. I found also that the correct phonemization is playing a huge role in tts prosody. I might retrain the model with a different phonemizer (https://github.com/as-ideas/DeepPhonemizer). The espeak phoemizer sucks in German.

Again thanks for you recordings, its a nice dataset to do some tts research, I may get into emotional TTS soon on it.

Answer 8 · 2021-10-24T06:56:25.000Z

Hi @cschaefer26 👋
I've trained a "forward_tacotron" model till the end of 300k steps with your repo and using the universal hifigan vocoder quality is really good. But the speech is way too fast or the breaks between words and separate sentences is too short. Do you have any ideas on how to improve this?

Just in case i'll remove the fastest recordings from dataset and start a new training.

https://sndup.net/39xm

Answer 9 · 2021-10-24T16:13:10.000Z

Hi, yeah I have encountered that problem before with ForwardTacotron for some datasets. Seems to me that the duration predictor is somehow overfitting. You could either take an earlier model (e.g. 50k steps, which should be fine) or train a model with model_type: 'fast_pitch' (in the config) - the FastPitch models didn't show this behaviour yet.

Answer 10 · 2021-10-24T16:28:24.000Z

An earlier checkpoint could be worth a try. A FastPitch training is running already.

Answer 11 · 2021-10-30T10:38:38.000Z

Here's a sample from a FastPitch model (300k) and HifiGAN. It's too slow and unnatural.
A speed between FastPitch and ForwardTacotron would be nice :-).

https://sndup.net/2q9j

Answer 12 · 2021-10-30T10:58:22.000Z

Hi, yeah thats too slow and the other one is way too fast. Have you tried earlier models? E.g. after 50k steps the quality is usually on par with 300k steps. Also, did you produce the audio with a single input? The quality is better when providing shorter sentences (and later concatenating the wavs). You can actuallly adjust the speed with the alpha param (larger alpha is faster):

python gen_forward.py --alpha 1.2 --input_text 'this is whatever you want it to be' hifigan

Answer 13 · 2021-10-30T17:54:44.000Z

I tried some model variations and find following model quite useful:

Forward Tacotron configuration
Model checkpoint 300k
alpha 0.8

Here's a sample:
https://sndup.net/6py7

Answer 14 · 2021-10-30T19:58:50.000Z

I've added the audio samples based on your repo on my comparison page (including link to your repo obviously):
https://twitter.com/ThorstenVoice/status/1454537933558620174?s=20

Answer 15 · 2021-10-31T11:36:45.000Z

Cool, sounds quite good with alpha=0.8. Seems to me though that you synthed the full thing at once? I really recommend splitting longer texts into single sentences as the model has been trained on sentences (especially the fastpitch models have trouble with longer sentences as they have to distribute their attention more).

Answer 16 · 2021-10-31T15:11:33.000Z

You're right. I'll split longer texts, but in general i like the quality of synthesized voice.

Answer 17 · 2021-11-03T19:19:50.000Z

@cschaefer26 i've uploaded my model files and will share the link via Twitter. Do you have a Twitter account i should link to or just your repo?

Answer 18 · 2021-11-04T10:22:20.000Z

Hi, thanks for sharing the models and glad you like the synth. I don't have a Twitter account, if you like you could link the repo :-)

PS: you could use the repo to gain some insight about your dataset. You can load the attention score dictionary that provides for each file id a score which measures how sharp the tacotron attention was. Low scores correspond to a mismatch between text and audio file. Copy paste this in the main ForwardTacotron directory and execute:

from utils.files import unpickle_binary

if __name__ == '__main__':
    att_dict = unpickle_binary('data/att_score_dict.pkl')
    id_score = [(k, v[1]) for k, v in att_dict.items()]
    id_score.sort(key=lambda x: -x[1])
    for id, score in id_score:
        print(id, score)