kanttouchthis/text_generation_webui_xtts

Feature Suggestion - Toggle GPU/CPU generation.

erew123 opened this issue · 2 comments

As the title says, a toggle switch to change between CPU or GPU generation. AKA swapping the instances:

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
to/from
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cpu")

within the interface. Im guessing it would need to re-load the TTS engine each time you do this, and Im not sure how possible that is. Im only guessing it may need a reload as I notice when you set it to "cpu" it pre-caches the model in system RAM. I put a bit of an explanation in the ticket here: #5 (comment)

Obviously this would be possibly better for people who have a low VRAM issue.

In short, if you are short on VRAM, because your AI model fills up all your VRAM, it takes a long time to generate audio, and in some cases, depending on your CPU capabilities it may be faster to process the audio generation on your CPU with the TTS model loaded into system RAM. I'm not saying its as fast as a GPU at all, but on a low VRAM card, it can be faster than using your GPU. Here are a few speed tests that were generating audio for roughly the same amount of tokens/text.

I would also like to caveat here that, I think if the text being converted to audio is full of punctuation (commas, exclamations etc) that also slows down the processing, making my tests below not a 100% accurate as different text was generated on each test. However, the amount of tokens generated and audio processing times are shown.

RTX 4070 12GB with a 13B model loaded using about 11.7GB of the VRAM and only 300MB free VRAM
Output generated in 6.71 seconds (19.66 tokens/s, 132 tokens, context 74, seed 1793814269)
Processing time: 169.98434138298035

Output generated in 5.20 seconds (21.53 tokens/s, 112 tokens, context 74, seed 2068028316)
Processing time: 270.1332013607025

Output generated in 4.37 seconds (20.37 tokens/s, 89 tokens, context 74, seed 670928270)
Processing time: 90.19319772720337

Output generated in 5.02 seconds (20.90 tokens/s, 105 tokens, context 74, seed 2069843369)
Processing time: 87.67613339424133

CPU & RAM - no GPU at all - a slight % speed increase from using only 300MB left in the VRAM of the GPU
Output generated in 4.17 seconds (21.83 tokens/s, 91 tokens, context 74, seed 811502282)
Processing time: 76.13149285316467

Output generated in 4.60 seconds (24.99 tokens/s, 115 tokens, context 74, seed 52612104)
Processing time: 106.75732398033142

And just to show that if you have plenty of VRAM spare, its fast

RTX 4070 12GB with a 7B model loaded model loaded using about 8.5GB of the VRAM and 3.5GB free VRAM
Output generated in 2.13 seconds (24.46 tokens/s, 52 tokens, context 74, seed 568523180)
Processing time: 9.31659722328186

Output generated in 2.28 seconds (25.84 tokens/s, 59 tokens, context 74, seed 986042543)
Processing time: 13.92505669593811

Output generated in 2.18 seconds (25.67 tokens/s, 56 tokens, context 74, seed 671001220)
Processing time: 17.42378520965576

For now i added a config option for the device, so you can set it to cpu in there. Alternatively you can try setting "cpu_offload" to true and "device" to "cuda" which will transfer the model to gpu when it's running and then back to the cpu when it's not running.This should help with low vram during text generation, but the tts might still be slow.

Ill close the ticket off! Thanks! :)