Style tokens as guide rather than 1:1 transfer
Closed this issue · 4 comments
Thanks again for the great work on Mellotron.
The usual implementations of Global Style Tokens allow for transfer of style without locking the target inference to a 1:1 rhythm transfer.
For example, using a 1min reference audio with Mellotron appears to be generating a 1min output regardless of the text input, whereas other GST implementations transfer the style without locking in the rhythm / duration of the reference audio 1:1, such as inference of a 5sec sentence on a 1min reference, while still keeping the 'style' of the reference audio.
Is there any change to the model to enable such a scenario?
In reference to the GST part of mellotron, there is no 1:1 lock. You can use GST the same way as in other repos.
If you want to do inference with the mellotron model however, we additionally extract two things from a reference audio: the rhythm and the pitch which creates the 1:1 correspondence. It's the rhythm that creates the 1:1 correspondence actually. But your automatically-extracted pitch might not make sense if you do not additionally condition on the rhythm.
If you don't want rhythm (which you can disable by using model.inferece()) and pitch conditioning (which you can disable by sending zeros as the pitch), you get essentially tacotron 2 with GST and speaker ids.
Thank you @blisc for the quick reply - much appreciated!
thanks @blisc
@blisc .. I have a question on similar lines.. I have trained the model using this repo on LJ speech. During inference i use a out of dataset file as style file. The synthesized speaker quality has changed very much. The quality is decent but it doesn't sound like the original speaker of LJ speech. How to fix that? Please can you help. Thanks.