nerdaxic/glados-voice-assistant

Local TTS Engine Requirements

Opened this issue · 19 comments

Are there undocumented requirements to use the new local glados-tts? I have tried directly cloning and running and against this project with a clean Ubuntu install and walking through every dependency installed without success. Right now on clean install, installing everything listed and running I get:

Initializing TTS Engine...
Traceback (most recent call last):
  File "glados.py", line 22, in <module>
    glados = torch.jit.load('models/glados.pt')
  File "/usr/local/lib/python3.8/dist-packages/torch/jit/_serialization.py", line 161, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Unknown qengine

What are your machine specs? The pytorch model being used requiresa CPU that supports AVX2 instructions. Also, are you running it in a VM?

I initially was running on windows via the linux subsystem but that is a whole can of worms when dealing with audio so I abandoned it pretty quickly. I then tried on windows but that complains about not having espeak even though I did install it after I saw that error but still fails on that error. I used the ubuntu subsystem to inspect /proc/cpuinfo so it appears my windows system does support avx2 but because windows seems to be an issue, won't work. The error from above is from a different dedicated piece of hardware, not a VM having the issue that does not report avx or avx2 in its cpuinfo. So that may be the requirement I am missing, a piece of hardware that supports AVX2.

From what I read, the pi doesn't won't support this correct?

Yes, sadly at this point the pi doesn't support the tts engine. Someone had given us very vague instructions on how to 'maybe' get it to work, but it was way above my head and didn't look all that promising.

I have managed to get the tts-engine to work on my windows machine. There is something I had to do to get espeak to work, something about adding it's library to the windows environment variables.

When it comes to audio, I also got it to work on windows for the engine. I haven't updated the main glados assistant to include the same code as I only use windows for the tts, then my pi for everything else.

Do you know what variable you added to windows? Also, do you have a link to the comment concerning pi compatibility?

Just did some googling and testing so if anybody else comes here looking for the answer...

You'll need to use espeak-ng, https://github.com/espeak-ng/espeak-ng/releases.

And then referencing this issue bootphon/phonemizer#44,

Add the env variable: PHONEMIZER_ESPEAK_LIBRARY
Set the value to point to the .dll file.

So assuming you didn't change the default install location of espeak ng, you
PHONEMIZER_ESPEAK_LIBRARY="c:\Program Files\eSpeak NG\libespeak-ng.dll"

Yea good job. That's what I had to do. Thanks for documenting it here.

@SuperJonotron @eternalliving Sorry for the late reply. I'm the person who made this TTS. Both the TTS and vocoder were optimized with torchscript. The TTS specifically always runs on the CPU because it is so fast that it does not need GPU speedup. The vocoder is much more costly. With a more efficient vocoder the e2e latency would be extremely low. In the folder there are also high and low quality CPU models which remove the requirement for a GPU. All of the models (excluding the GPU vocoder) have been quantized for CPU inference. I believe that internally they use XNNPACK, which has SSE, AVX, AVX2, AVX512, and even NEON implementations, so it should be able to run on a Raspberry Pi even (extremely slowly).

@R2D2FISH
After testing all of this out on windows, I used WSL docker on windows to develop and test this with docker on a windows machine I knew had the correct AVX2 capabilities and this works perfectly (https://github.com/SuperJonotron/glados-tts). I just tried to run that on a pi 4 and get the response

INFO: Initializing TTS Engine...
Traceback (most recent call last):
  File "engine.py", line 22, in <module>
    glados = torch.jit.load('models/glados.pt')
  File "/usr/local/lib/python3.8/dist-packages/torch/jit/_serialization.py", line 161, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Unknown qengine

Seems to be an error associated with missing AVX2 but if there's something else that needs to happen for this to run on a pi, let me know because I don't think it's supported because of that requirement. I'd be happy to be proven wrong though.

@SuperJonotron Not sure if this works, but try running this command before loading the models: torch.backends.quantized.engine = 'qnnpack'

@R2D2FISH I had seen that comment somewhere but was unclear on where it was intended to be used so thanks for that clarification. Tested it out and seems to work. Here's some benchmark comparisons for anybody else interested in performance before going this route.

More Powerful computer with AVX2 Support:
Without qnnpack set:
Startup: ~10seconds
Generation: 2199 ms seconds for test phrase - Oh, it's you

With qnpack set: ~60 second startup
27932 ms seconds for test phrase - Oh, it's you

Approximately 12.7x slower with qnnpack on system with AVX2 support using qnnpack

On RPI4 Model B Rev 1.4 8GB RAM:
Without qnnpack set:
Startup: ~25 second startup
Generation: 8631 ms seconds for test phrase - Oh, it's you

Approximately 4x slower than the AVX2 optoin on a more powerful system.

So RPi runs faster with qnnpack than more powerful system that has AVX2 but still slower. Using cache though still returns instantly so this should let you generate a library locally and still have instant responses if using the cache option.

@SuperJonotron Awesome! On AVX2 systems you might want to test replacing 'qnnpack' with 'xnnpack', 'fbgemm', or 'mkldnn'. I'll modify the code to autoselect a backend based on the host device. I suspect that mkldnn will be the fastest, but I can't test this myself at the moment because my pytorch build is missing mkldnn (it was causing build errors)

@R2D2FISH Tested out the other options on the AVX2 hardware and here are the results for performance:
xnnpack and mkldnn both not supported but I don't know enough about this framework to really know why or if this would be expected:
RuntimeError: xnnpack is not a valid value for quantized engine
RuntimeError: mkldnn is not a valid value for quantized engine

Running with fbgemm (same message 4 times):
2264 ms to generate
2182 ms to generate
2065 ms to generate
2144 ms to generate

Running with nothing specified (same message 4 times):
2202 ms to generate
2075 ms to generate.
2141 ms to generate
2111 ms to generate

Looks to me like an AVX2 system already chooses the most optimized setting since these times look basically the same. Hopefully this makes an update to choose the correct one on other hardware all that's necessary for you when AVX2 is already supported.

@jhughesbiot Good to know.

@SuperJonotron A few things to note. First of all, the reason the startup is so long is that in order to load the models into RAM they made it run like 4 empty strings in a row when it first loads. You may want to experiment with altering that number or removing it. Additionally, they discovered that my "quantized" models are actually slower than the standard version, and switched to that one. This may not be the case on aarch64, so you might want to try the 'vocoder-cpu-hq.pt' model and see if it performs any better. Finally, and perhaps most excitingly, apparently the Raspberry Pi 4 has full Vulkan 1.2 conformance now, so you may actually be able to run these models on the GPU now with the right drivers and a custom build (I don't think prebuilt pytorch has vulkan enabled).

Don't really have any experience with torch, vulkan and the various models created for this project so not really sure where I'd start on getting the RPi to a custom build with it but I'd be happy to test something out.

Looks like using the vocoder-cpu-hq.pt instead of the gpu model on the RPi4 as is drops the time down a little more than half. Seeing about a 1.7x (~4x with gpu model) slower time than the AVX2 option which isn't that bad if you're already not expecting real time responses as is.

@SuperJonotron That's actually amazingly quick for a Pi. I'll try cook up a Vulkan enabled aarch64 build on my laptop later today if you'd like to try it out. I'll throw in ThinLTO for good measure ;). I'm pretty sure that the reason why the quantized models are not running very fast on desktop is actually because qnnpack and xnnpack are disabled in the desktop builds (at least on Windows). I trained this model on an extremely cobbled together version of pytorch and rocm both built from source in order to allow me to train on my AMD GPU laptop so a lot of options were left turned on, which is probably why I was seeing better performance with the quantized builds. Are you running Raspberry Pi OS?

@R2D2FISH For both systems, I am using my fork that wraps this project in docker. The base image for that docker container is Ubuntu 20.04. On my windows machine with AVX2 I run this via the linux subsystem (WSL) and docker desktop. RPi is running Ubuntu 20.04 Desktop but since I run both tests using docker, there really is no difference in what the host is running since they both execute in the same docker environment of ubuntu 20.04 and just use the hardware available.

I'll definitely try out any improvements that might improve speed.

I managed to get glados-tts to run on my system that doesn't support AVX2 by rebuilding pytorch you can find my blog post about it here. https://blog.longearsfor.life/blog/2023/11/26/building-pytorch-for-systems-without-avx2-instructions/ I hope this helps anyone running into similar problems.