NVIDIA/flowtron

Output length is fixed?

andi-808 opened this issue · 5 comments

Hello,

I am in the process of training from a pre-trained model (no success so far). Running inference on some of the models I’ve produced produces sound. It doesn’t matter how long my desired sentence is I wish the model to speak, the length of the sound clip produced is always 5 seconds, 410KB.

Is there something I’m missing? Is it because my models are currently garbage and won’t produce the correct output until good attention is achieved? The voice tone/timbre sounds correct, albeit gibberish.

Ok, so the file size now changes now that I’ve properly trained with my data. It took a while but I’m getting speech out with my dataset. However, there is a maximum limit for the length of the utterance.

Do I warm start with “n_text” set to a larger value? I tried this but got an error before it even started training.

I managed to get the length to vary during inference.

I managed to get the length to vary during inference.

Can you share, how could you do that?

I managed to get the length to vary during inference.

Can you share, how could you do that?

Hey, the output was fixed for as long as it was still making progress training. At the point where it looked like it was over-fitting, I stopped and reduced the learning rate. Once I managed to find the minimum, actual legible vocabulary would be produced. As it was getting closer, the output length would vary, getting closer and closer the more it learned.

I managed to get the length to vary during inference.

Can you share, how could you do that?

Hey, the output was fixed for as long as it was still making progress training. At the point where it looked like it was over-fitting, I stopped and reduced the learning rate. Once I managed to find the minimum, actual legible vocabulary would be produced. As it was getting closer, the output length would vary, getting closer and closer the more it learned.

Thanks for your reply and explanation. Need some revision, do you mean that continue training with reducing learning rate and finding local minumum will increase lenght of pronouncible output?