auspicious3000/contentvec

HuggingFace Version

leng-yue opened this issue ยท 14 comments

Thanks to the author for his/her fantastic model that nearly eliminated the timbre leaking problem.
For someone who doesn't want to use the FairSeq, I made a HuggingFace version of content vec best legacy: Link.

@leng-yue Thank you for your contribution. Based on the model definition in our paper, it appears that removing the final projection is necessary to achieve the desired outcome.

Appreciate your quick update โšก๏ธ
Although I recommended removing the final_proj function call in the sample code, I also decided to keep it in the model for the sake of backward compatibility. This is because there are several existing models, such as so-vits-svc, ddsp-svc, and fish-diffusion, that have been using the final_proj since 1 or 2 months ago. ๐Ÿ˜‚

Surprisingly, adding the final projection did not cause any problems.
First, the final projection is trained on the output of the final layer instead of the 9th layer. Applying the final projection on the 9th layer may cause mismatches.
Second, the final projection is injected with speaker information. While the purpose of contentvec is to remove speaker information, adding the final projection may defeat the purpose of contentvec.

To my mind, this is probably because they use the final projection with the 9th layer, which may not be in the same latent space. In this case, the final projection may not be able to inject speaker information...

I believe removing the final projection may improve the performance of those projects.

Anyway, for those who would like to use this Huggingface interface, please remove the final projection to get the correct output.

Currently, the final_proj is just left there, and it's not called by default (in the forward).

Very very great discussion!

I converted huggingface model to onnx.
Size is very small (280MB) ! And it work fine in my app that is realtime voice changer based on so-vits-svc.

Converter repo is here.