A Question about vocals
francqz31 opened this issue · 1 comments
Hello Mr author
I had some couple of Questions I hope they get answered.
So I want to train a 24khz to 48khz model for singing raw vocals that have no background music nor Reverberation at all,
just pure raw studio vocals but then I have some Questions.
1-These vocals are in different languages English has approximately 50 hours of vocals, Chinese 90 hours,
Korean 2.5 hours , Japanese 6.6 hours and Italian 1 hour. So the question is, would having different languages confuse and corrupt the algorithm and make the output bad and hallucinating or would it have a greater good impact or should I only train the English ones ?
2- I Should keep my almost 150 hours dataset alone and not mix it with MusDB dataset that has musical mixtures and full songs with instrumentals on, right?? just treat this 150 hours dataset like VCTK, right?
3- Some of these raw vocals aren't transcribed , is this okay ? Does aero train and accept non transcribed data.
4-I Don't really understand FFT , hop lengths and window sizes thingy. What will produce the best quality out of
these parameters for my case which I stated above? The (64/512) or the (256/1024) or (256/512) or (128/512)?
And Also how to know the configuration aka FFT , hop lengths and window sizes
of my dataset and how to convert my
dataset to the specific configuration that I want out of the 4 or the one that you will recommend.
I don't mind the training time I want the most optimal and the most highest quality for upscaling so if there is another configuration That isn't mentioned above tell me about it.
Again sorry if I'm asking basic questions, And thanks for your time.
Hey there,
-
You can try to train on a more general dataset, it might even work better than VCTK. In general, I'm not assuming anything about the language, so there is a good chance it will work. It might require some tweaking. If you encounter problems, I suggest that you first train without a discriminator - only the Multi-Resolution STFT loss, to see that you are able to generate high frequencies. It will probably have audible artifacts, but nonetheless - if you succeed to generate high frequencies, it's a first step.
-
It might be more difficult to train music and speech at the same time, as they are very different. But I don't know, maybe it's worth a try. Let me know what you found out.
-
No need for transcription.
-
Empirically I found that 64/512 reaches the best quality, but it takes the most computational resources. So there is a trad off.
-
Just make sure your dataset is in the correct sample rate (e.g. Low Res = 4/8/16 kHz, High res = 16/24/48, or whatever you see fit) in accordance with the configuration files. Technically, you can use any FFT, hop length and window sizes you want, and it should not be a problem as the code should deal with it. But I can't guarantee that you won't need to do a bit of tweaking in the code for it to work.
-
In general, the most optimal configuration is the hop-length of 64. Though for the high-resolution sample rates: e.g. 12->48 kHz, this takes a lot of computational resources. So it really depends on what GPUs you have within your reach.
Sorry for the late reply.
Let me know how it went and Good Luck!
M