anonymous-pits/pits

voice conversion

Opened this issue · 2 comments

p0p4k commented

I tried. But output is not satisfactory (voice doesn't change that much).
Am i doing anything wrong? Thanks.

    def voice_conversion(self, spec, spec_length, ying, ying_length, g_src, g_tgt):
        z_spec, m_spec, logs_spec, spec_mask = self.net_g.enc_spec(spec, spec_length, g=g_src)
        z_yin, m_yin, logs_yin, yin_mask = self.net_g.enc_pitch(ying, ying_length, g=g_src)
        z_yin_crop, logs_yin_crop, m_yin_crop = self.net_g.crop_scope(
                    [z_yin, logs_yin, m_yin], scope_shift=0)
        z = torch.cat([z_spec, z_yin], dim=1)
        y_mask = spec_mask
        
        z_p = self.flow(z, y_mask, g=g_src)
        z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
        z_spec, z_yin = torch.split(z,self.inter_channels - self.yin_channels,  dim=1)
        z_yin_crop = self.crop_scope([z_yin], 0)[0]
        z_crop = torch.cat([z_spec, z_yin_crop], dim=1)
        decoder_inputs = z_crop * y_mask
        
        o_hat = self.dec(decoder_inputs, g=g_tgt)
        
        return o_hat, y_mask, (z, z_p, z_hat)

I think you are not feeding the flow output into the decoder maybe decoder_inputs = z_hat * y_mask ?

For better output, I recommend �three things.

  1. You need to find text alignment to sample aligned prior as algorithm. You need to give text condition to find it.
image
  1. You need to fit speaker's mean pitch. For e.g. male-female conversion, you need to find optimal scope-shift s to shift more value instead of zero as self.crop_scope([z_yin], 0)[0]. Its why scope-shift s is mentioned in algorithm.
image
  1. You need iteration in algorithm. It provides more stable output.
image