Style tokens as guide rather than 1:1 transfer

Question

Style tokens as guide rather than 1:1 transfer

Closed this issue 5 years ago · 4 comments

Thanks again for the great work on Mellotron.

The usual implementations of Global Style Tokens allow for transfer of style without locking the target inference to a 1:1 rhythm transfer.

For example, using a 1min reference audio with Mellotron appears to be generating a 1min output regardless of the text input, whereas other GST implementations transfer the style without locking in the rhythm / duration of the reference audio 1:1, such as inference of a 5sec sentence on a 1min reference, while still keeping the 'style' of the reference audio.

Is there any change to the model to enable such a scenario?

Answer 1 · 2019-12-13T00:43:57.000Z

In reference to the GST part of mellotron, there is no 1:1 lock. You can use GST the same way as in other repos.

If you want to do inference with the mellotron model however, we additionally extract two things from a reference audio: the rhythm and the pitch which creates the 1:1 correspondence. It's the rhythm that creates the 1:1 correspondence actually. But your automatically-extracted pitch might not make sense if you do not additionally condition on the rhythm.

If you don't want rhythm (which you can disable by using model.inferece()) and pitch conditioning (which you can disable by sending zeros as the pitch), you get essentially tacotron 2 with GST and speaker ids.

Answer 2 · 2019-12-13T01:32:32.000Z

Thank you @blisc for the quick reply - much appreciated!

Answer 3 · 2020-03-12T19:33:16.000Z

thanks @blisc

Answer 4 · 2020-03-12T19:36:19.000Z

@blisc .. I have a question on similar lines.. I have trained the model using this repo on LJ speech. During inference i use a out of dataset file as style file. The synthesized speaker quality has changed very much. The quality is decent but it doesn't sound like the original speaker of LJ speech. How to fix that? Please can you help. Thanks.