v.0.0.6 XTTS Generation works as it should but VoiceCraft has several issues
Bookwald opened this issue · 7 comments
As of v.0.0.6, VoiceCraft produced audio that repeats lines and uses two voices. VoiceCraft also doesn't output all the sentences but rather the first few. Generation time is very long. I've seen it eat up 24GB VRAM and an additional 70GB RAM only to output 37 seconds of audio after 7 minutes of generation time.
Sample: https://sndup.net/v4x5/
Yes, VoiceCraft is far from perfect. Voice cloning is superior to XTTS, but it's not reliable - sometimes it doesn't generate the whole text it was given or drastically changes the pitch, basically hallucinates. I think it was made primarily with speech editing in mind, not TTS per se. Their TTS model is very new and they will hopefully publish a better one soon. It's usable for generating relatively short fragments and manually regenerating sentences until it produces something decent. I will expose more settings in the GUI, perhaps you can improve the results by playing with them. As for VRAM/RAM consumption and speed, I'm afraid I cannot do anything about it now until the authors release a new model or implement something like deepspeed. How many seconds of the sample were you using? Try playing with this - from 3 to 12, different lengths can work well for different voice samples.
I can see they uploaded new TTS models yesterday, but I can't find the actual files. Will add them when I do. I should be able to expose the additional API settings in the GUI today. But I think for longer generations I'd recommend using XTTS with RVC for best voice cloning results.
I added the other parameters to the GUI under "Advanced settings". Please try disabling the cache (set the value from 1 to 0, it drastically reduced VRAM usage for me) and play a little with the others (try setting "stop repetition" to 2, for example, or possibly "sample batch size" to 2). I will update the API to use the newer models tomorrow and enable model selection in the GUI.
Thanks for working on this. I'm looking forward to trying out the newer models. I'll test disabled cache and repetition.
I've added model selection to the GUI. Please try both the 330M model and the larger 830M model. Also, the cache is now cleared after every generation, which was an update to the original code that I overlooked. Perhaps this will solve the issue with excessive VRAM usage without disabling cache.
With VoiceCraft I still had to disable cache to reduce VRAM usage. I'm getting a repeat of the first line of my reference audio at the beginning of each sentence of my input text. However, I'm finding XTTS to work excellent with RVC on top. XTTS captures the way the person speaks and RVC gives the texture of the voice back.
I tested it today on a rented vm with a 3090 and it used about 12GB of VRAM (with cache on). There were no instances of reference audio in the generations and generally the quality was... decent, though I still think it's easier to generate long texts with XTTS (fewer regenerations are needed). Have you updated / reinstalled VoiceCraft API? Anyway, I'm glad that XTTS + RVC works well for you :)
PS. I generated this today using VoiceCraft (no regenerated sentences, took about 9m on a 3090, I don't remember which model it was): http://sndup.net/p4q9