voice conversion
Opened this issue · 2 comments
p0p4k commented
I tried. But output is not satisfactory (voice doesn't change that much).
Am i doing anything wrong? Thanks.
def voice_conversion(self, spec, spec_length, ying, ying_length, g_src, g_tgt):
z_spec, m_spec, logs_spec, spec_mask = self.net_g.enc_spec(spec, spec_length, g=g_src)
z_yin, m_yin, logs_yin, yin_mask = self.net_g.enc_pitch(ying, ying_length, g=g_src)
z_yin_crop, logs_yin_crop, m_yin_crop = self.net_g.crop_scope(
[z_yin, logs_yin, m_yin], scope_shift=0)
z = torch.cat([z_spec, z_yin], dim=1)
y_mask = spec_mask
z_p = self.flow(z, y_mask, g=g_src)
z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
z_spec, z_yin = torch.split(z,self.inter_channels - self.yin_channels, dim=1)
z_yin_crop = self.crop_scope([z_yin], 0)[0]
z_crop = torch.cat([z_spec, z_yin_crop], dim=1)
decoder_inputs = z_crop * y_mask
o_hat = self.dec(decoder_inputs, g=g_tgt)
return o_hat, y_mask, (z, z_p, z_hat)
meriamOu commented
I think you are not feeding the flow output into the decoder maybe decoder_inputs = z_hat * y_mask ?
anonymous-pits commented
For better output, I recommend �three things.
- You need to find text alignment to sample aligned prior as algorithm. You need to give text condition to find it.
- You need to fit speaker's mean pitch. For e.g. male-female conversion, you need to find optimal scope-shift
s
to shift more value instead of zero asself.crop_scope([z_yin], 0)[0]
. Its why scope-shifts
is mentioned in algorithm.
- You need iteration in algorithm. It provides more stable output.