jerryuhoo/VISinger

Samples?

Opened this issue · 5 comments

@jerryuhoo Hello how are you !

I'm wondering how the samples sound like ? are they as good as the original visinger 1 ? i would love to take a listen.

Hi, the audio sample in this repository is not as good as the original Visinger 1, but I strongly recommend you to give it a try on https://github.com/espnet/espnet/tree/master/egs2/opencpop/svs1. I have also implemented this one, and it has better results compared to this repository.

Currently, the training time has not reached the length mentioned in the original paper, so the audio quality may be slightly lower compared to the original paper. However, the audio quality of Visinger2 on espnet is almost consistent with the original paper.

@jerryuhoo Thanks so much for your answers I really appreciate it .

One more question for me if you don't mind.

So now I developed my own SVS algorithm and it is great and also a bit better than Visinger2

But since SVS needs music score (note duration sequence, and note pitch sequence) + lyrics of course but I already have a SOTA Singing to text transcriptions so no worries for that part for me ) I have hard time making a dataset especially for the English language and I have to use (m4singer, opencpop, Opensinger) which are all in the Chinese language which limits the evaluation for the listener.

Having said that do you have any idea how I can easily get the music score for English vocals or automate it without all the manual hard annotations that would take months , I have about 6 hours of English vocals almost as long as "opencpop" ? is there any tool here or there or an easy new methods ?

Thanks in advance.

This is also a concern of mine at the moment. The duration could possibly be addressed using MFA (Montreal Forced Aligner). MIDI could be used for pitch extraction and then aligned based on tempo. However, currently, I don't know any other tools that can quickly accomplish this.

@jerryuhoo Hey, how is it going!

I wonder if you have seen these two projects https://github.com/openvpi/SOME for midi
and https://github.com/openvpi/SOFA
I wonder if those can help us in labeling , I haven't really read up on them well still.

I think SOME converts the singing voice to midi almost accurately but I wonder if it can be used to get the midi sequence and midi duration sequence and how , if you find this interesting then I suggest looking into them and to see what's up. I want to hear your thoughts.

Thanks in advance.