Low performance issue/question
jepjoo opened this issue ยท 14 comments
I'm seeing an example of 29s of audio rendered in ~3s, so about a 10:1 ratio on a 4090 here:
https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts/tree/main#example
But on my 4090 (+Ryzen 7600X) Win 11 system I'm seeing more like a 3:1 ratio.
- GPU usage is at 25-40% during audio rendering, drawing ~70W.
- Lots of VRAM free
- CPU usage isn't super heavy during audio rendering
- Textgen is up to date and CUDA version 12.1 if that matters
Any ideas what's bottlenecking me? And anyone else seeing worse than expected performance?
Someone mentioned that the generation speed depended on the sampling rate and number of channels in the reference audio. Try resampling your audio to 24000hz mono and see if that changes anything. The model samples at 24k mono anyways so there shouldn't be any difference in quality
Should have mentioned, I mostly tested with the included example.wav which seems to be 22khz mono. Poor performance with that too.
@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file()
Here's a short-hand reference implementation of what I have locally that usually is 1-2sec per 20 sec of audio on a 3090:
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import numpy
import nltk
config = XttsConfig()
config.load_json("<path to xtts2 config>")
config.temperature = 0.65
config.decoder_sampler = 'dpm++2m'
config.cond_free_k = 7
config.decoder_iterations = 256
config.num_gpt_outputs = 512
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="<path-to-model-folder>", use_deepspeed=True) #deepspeed isn't required
model.cuda()
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=<path-to-ref-wav>)
#make the latent and embeds lists and load in multiple times for different characters
Run all that stuff outside of the usual loop in chat - that's init level stuff and should just be run once and stashed. Then on actual call:
def run_voice(chat_interface_text,character):
global voice #my local object that plays audio - but you can keep the save/load to file
out = []
sentences = nltk.sent_tokenize(chat_interface.value[-1].value.replace("</s>",""))
silence = numpy.zeros(int(0.35 * 24000))
for i, j in enumerate(sentences):
out.append( model.inference(
j,
"en",
gpt_cond_latent[character], #example passing in different latents and embeds for different chars
speaker_embedding[character],
temperature=0.7, # Add custom parameters here
)
)
stitched = numpy.concatenate([numpy.append(i['wav'], silence) for i in out])
voice.object = numpy.int16(numpy.array(stitched,dtype=numpy.float32)* 32767)
#to write numpy obj to disk instead, run: torchaudio.save('file.wav', torch.tensor(voice.object).unsqueeze(0), 24000)
return
Two things that might speed up your inferencing and voice outputs:
- If using Windows, enable Hardware-accelerated GPU scheduling, this is a setting in the "graphics settings" just turn it on and restart your computer.
- Let your computer boot all the way and log in, then restart your computer once again and log into your system BIOS and enable "resize bar" this also helps reduce latency.
Two things that might speed up your inferencing and voice outputs:
1. If using Windows, enable Hardware-accelerated GPU scheduling, this is a setting in the "graphics settings" just turn it on and restart your computer. 2. Let your computer boot all the way and log in, then restart your computer once again and log into your system BIOS and enable "resize bar" this also helps reduce latency.
Thanks for the tips!
Turning off HW accelerated GPU scheduling lowered "Real-time factor" from about 0.37-0.4 to 0.28-0.3. That's a pretty decent boost. I could also observe an increase in GPU usage during audio rendering.
I have another observation, though Im going to open another ticket about it as a feature request. Ill try keep my explanation simple for here though.
I have a 12GB card and loading a 13B model in that card uses 11.7GB of the VRAM, so only 300MB VRAM is left.
My AI text generation is nice and fast 20 tokens a second. However, when it goes to process the audio, its clearly swapping the TTS model into the graphics VRAM, perhaps in chunks as you dont see any major memory changes. So to process say 4x lines of text with this setup can take 60 seconds.
If I load a 7B model, which only takes about 8.5GB of my VRAM, because I now have 3.5GB of VRAM free, the TTS model can easily load into the VRAM without issue, in one nice lump. Generating the Audio output, now drops between 9 to 20 seconds! Which is fantastic...... though I'm now using a less powerful model!
I tried editing the script.py and changing the references to "cuda" to "cpu" which loads the TTS into your system ram and processes it on you CPU, not your GPU. In my case I have an 8 core 16 thread CPU.
Is CPU rendering faster when I'm using a 13B model and short on VRAM? Yes, just about, I think..... processing on my CPU in that situation may just be a bit faster, perhaps 10-15%. Obviously its NOT faster than processing when I am using a 7B model and my GPU with 3.5GB of VRAM spare.
I'm thinking you need about 1.5GB of VRAM to fit the TTS in, maybe closer to 2GB to do it comfortably.
So, it may be faster to use your CPU in some instances, depending on how much VRAM you have left after you have loaded your model and depending on how fast your CPU is.
At this point in time, you can try on your own system and experiment, but I guess what Im saying is "If you dont have much VRAM left on your card, after loading your model, expect a slower processing time for Audio".
If you edit the text-generation-webui\extensions\text_generation_webui_xtts\script.py file, to change the 3x "cuda" to "cpu" in there, you DO have to reload Text-Gen-WebUI. (unloading and reloading on the session tab may work, not tried it).
Me neither i don't have the performance reported by RandomInternetPreson but when i use the realtimeTTS version installed in the same environment which use coqui engine i can generate few sentences in one sec but the realtimeTTS version is not recording any file and is not an extension integrated in ooba.
It proves though that my setup can do it so i don't really understand what is happening.
I only have a 4070 ti but i did these tests with a 7b q4 model and have 3.5 gb of vram left when the xtts model + the LLM are loaded.
It takes me 20 sec to generate 30 sec of audio on ooba with XTTS
@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file()
Here's a short-hand reference implementation of what I have locally that usually is 1-2sec per 20 sec of audio on a 3090:
from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts import numpy import nltk config = XttsConfig() config.load_json("<path to xtts2 config>") config.temperature = 0.65 config.decoder_sampler = 'dpm++2m' config.cond_free_k = 7 config.decoder_iterations = 256 config.num_gpt_outputs = 512 model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_dir="<path-to-model-folder>", use_deepspeed=True) #deepspeed isn't required model.cuda() gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=<path-to-ref-wav>) #make the latent and embeds lists and load in multiple times for different characters
Run all that stuff outside of the usual loop in chat - that's init level stuff and should just be run once and stashed. Then on actual call:
def run_voice(chat_interface_text,character): global voice #my local object that plays audio - but you can keep the save/load to file out = [] sentences = nltk.sent_tokenize(chat_interface.value[-1].value.replace("</s>","")) silence = numpy.zeros(int(0.35 * 24000)) for i, j in enumerate(sentences): out.append( model.inference( j, "en", gpt_cond_latent[character], #example passing in different latents and embeds for different chars speaker_embedding[character], temperature=0.7, # Add custom parameters here ) ) stitched = numpy.concatenate([numpy.append(i['wav'], silence) for i in out]) voice.object = numpy.int16(numpy.array(stitched,dtype=numpy.float32)* 32767) #to write numpy obj to disk instead, run: torchaudio.save('file.wav', torch.tensor(voice.object).unsqueeze(0), 24000) return
I can't replicate your results. For my sample text, the extension took 11.7 seconds. Your code took 10.9, but that is with the custom generation parameters. without those it also took 11.7 seconds. Your speedup likely comes from the fact that you're using deepspeed
I too have reworked your code to accommodate the suggestion with no speed increase. However! I think I know the reason.
https://tts.readthedocs.io/en/latest/models/xtts.html
Check out this link, you need deepspeed enabled. I haven't done this yet, but I will today, I think I need to enable deepspeed in oobabooga to get it working. Look at oobs repo front page they have instructions on how to enable deepspeed.
https://github.com/oobabooga/text-generation-webui#deepspeed
unfortunately deepspeed isn't oficially supported on windows. could probably get it running on wsl though
This is also on my list to try out, I tried doing it last night but was getting errors with a wsl install. The deepspeed documentation says that it should work in wsl. I wasn't loading on with --deepspeed though.
Also according to the deepspeed documention it does work in Windows, with the caveat that it only works for inferencing. Which makes me think it might work on windows with oob since windows prebuilt wheels come installed.
i'm able to run deepspeed on windows with python 3.9, i failed with python 3.10/11
I used this file to install it
https://huggingface.co/Jmica/audiobook_maker/tree/main
pydantic has to be under 2.0 or you will get some error
@kanttouchthis yep, most of the big speed difference is from deepspeed; the other smaller chunk is likely the re-compute of the latents and embeddings when doing the 'clone' each time, but that's not a huge task.
My general experience has been deepspeed can be a hassle to compile/run on a given env - it's certainly fantastic when you do get it working, but for those less used to dev work, it might be a slog if there isn't already just the right form packaged for them.