NVIDIA/mellotron

Inference without rhythm and pitch

kngan43 opened this issue · 0 comments

Hi,

I'm new to speech synthesis. I've trained my model on the emovdb dataset and want to do inference using the GST part of mellotron. I want to input any text and have it output speech with a certain emotion.

I noticed on issue#20 that someone mentioned the rhythm and pitch created a 1:1 aligment. Can someone explain more in detail about how to do inference without rhythm and pitch?