voice conversion

Question

voice conversion

Opened this issue a year ago · 2 comments

I tried. But output is not satisfactory (voice doesn't change that much).
Am i doing anything wrong? Thanks.

    def voice_conversion(self, spec, spec_length, ying, ying_length, g_src, g_tgt):
        z_spec, m_spec, logs_spec, spec_mask = self.net_g.enc_spec(spec, spec_length, g=g_src)
        z_yin, m_yin, logs_yin, yin_mask = self.net_g.enc_pitch(ying, ying_length, g=g_src)
        z_yin_crop, logs_yin_crop, m_yin_crop = self.net_g.crop_scope(
                    [z_yin, logs_yin, m_yin], scope_shift=0)
        z = torch.cat([z_spec, z_yin], dim=1)
        y_mask = spec_mask
        
        z_p = self.flow(z, y_mask, g=g_src)
        z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
        z_spec, z_yin = torch.split(z,self.inter_channels - self.yin_channels,  dim=1)
        z_yin_crop = self.crop_scope([z_yin], 0)[0]
        z_crop = torch.cat([z_spec, z_yin_crop], dim=1)
        decoder_inputs = z_crop * y_mask
        
        o_hat = self.dec(decoder_inputs, g=g_tgt)
        
        return o_hat, y_mask, (z, z_p, z_hat)

Answer 1 · 2023-07-14T19:12:27.000Z

I think you are not feeding the flow output into the decoder maybe decoder_inputs = z_hat * y_mask ?

Answer 2 · 2023-07-16T07:44:43.000Z

For better output, I recommend �three things.

You need to find text alignment to sample aligned prior as algorithm. You need to give text condition to find it.

You need to fit speaker's mean pitch. For e.g. male-female conversion, you need to find optimal scope-shift s to shift more value instead of zero as self.crop_scope([z_yin], 0)[0]. Its why scope-shift s is mentioned in algorithm.

You need iteration in algorithm. It provides more stable output.