NVIDIA/mellotron

Synthesize own text without style transfer gives poor audio results

ocesp98 opened this issue · 1 comments

When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice).
I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s.
The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.

text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()

f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
        (text_encoded, 0, speaker_id, f0))

with torch.no_grad():
    audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

Just upvoting to say I had same problem, so that's +1 for the "this might be normal" vote.