yxlllc/DDSP-SVC

Is there a pre-trained model for ContentVec encoder?

Closed this issue · 4 comments

Thanks for releasing the pretrained model for DDSP training, but the model seems to only be applicable to the Hubertsoft encoder. I would like to ask if there are any pre-trained models based on ContentVec(768layer12). If not, are there any plans to release such models in the future?

yxlllc commented

For the diffusion model part, both hubertsoft and contentvec (768layer12) pretrained models are released, so there are two links there.
For the DDSP model part, because its training is relatively simple and the data requirements are much lower, i think it may not be necessary to release any pre-trained model.
Soon I will also publish a "demo" model trained using the opencpop + kiritan dataset, which is based on the contentvec768l12 encoder, including the diffusion model and the DDSP (CombSub) model. If you really need a DDSP (CombSub ) pre-trained model using contentvec768l12, you can consider using this model.

Thank you! @yxlllc
Regarding the DDSP model, does that mean that loading a pre-trained model or not would not significantly affect the rate of convergence and model quality?

yxlllc commented

This mainly depends on the amount of data. If the dataset of a certain speaker is really small (for example, there are only dozens of audio clips), the use of pre-trained models may improve the quality, but the dataset can also be expanded by multi-speakers.

Unlike the diffusion model, the training speed of the DDSP model is fast enough, and the number of iterations required for training cannot be greatly reduced by using a pre-trained model (compared to multi-speaker training from the zero point).

Got it, thanks!