MycroftAI/mycroft-core

Allow using the Whisper speech recognition model

12people opened this issue · 12 comments

Is your feature request related to a problem? Please describe.
While Mimic has been continually improving, Open AI just released their Whisper speech recognition model under the MIT license that seems to be superior, yet still usable offline.

Describe the solution you'd like
It'd be great if Mycroft could either replace Mimic with Whisper or offer Whisper as an option.

Hi Mirek,

Thanks for starting this thread - Whisper is looking pretty interesting, certainly something that's come up in our own chats.

A couple of clarifications though. Mimic is our Text-to-Speech engine. It synthesizes spoken audio from some input text so that Mycroft can speak. Whisper as you've noted is for speech recognition or speech-to-text. That allows Mycroft to hear what the user is saying.

In terms of running offline you would need some decent hardware for this. I don't believe for example that it would be possible on the Mark II, which has a Raspberry Pi 4 inside. The max RAM you can assign to the GPU on the Pi 4 is 256MB. The smallest ("tiny") Whisper model requires 1GB VRAM. So yeah, unlikely to run on a Pi at all, but I'd be very interested if someone managed it.

More broadly I haven't seen any detail on what the training data for Whisper was. I'm assuming they're going the Microsoft / Github Co-pilot route of saying it doesn't matter and having a big team of lawyers ready to defend that. As a company we certainly don't have any position on this yet.

Whoops, you're right, I thought Mimic was an STT engine instead — my bad. But it sounds like I was understood nevertheless. :)

You're right that Raspberry Pi is certainly below the system requirements here. It'd be nice to see this as an option on Linux, though, where most modern systems have sufficient system requirements for at least the smallest model.

I have Mycroft on my robot. I'm using an RPI V4 with mostly standalone TTS and STT programs - DeepSpeech and Mimic3. I would like to tackle using OpenAI Whisper as my STT engine. I understand that it may not be resourceful for the Mycroft authors to implement Whisper but I would like to try to do that myself. Is there anyone out there who could give me guidance with that? It seems like all the interfaces and APIs are there now, based on the number of different apps available

You could have a look at the docs for creating STT-plugins: https://mycroft-ai.gitbook.io/docs/mycroft-technologies/mycroft-core/plugins/stt

And also refer to the STT classes includeded in mycroft core. MycroftSTT for example.

Thanks. That will get me started.

Whoops, you're right, I thought Mimic was an STT engine instead — my bad. But it sounds like I was understood nevertheless. :)

You're right that Raspberry Pi is certainly below the system requirements here. It'd be nice to see this as an option on Linux, though, where most modern systems have sufficient system requirements for at least the smallest model.

Maybe it can use Whisper through an inference API, which can be hosted by HuggingFace, Mycroft, or on your local network.

Maybe it can use Whisper through an inference API, which can be hosted by HuggingFace, Mycroft, or on your local network.

Something like whispering perhaps

Something like whispering perhaps

Do we need real-time? All we have to do is listen to a wake word, record the audio, stop when there's no more voice activity, send the audio to the server, and receive the transcription. I quite don't how it could be useful...

Not "real-time" just means it will take longer to return the result. In a really bad scenario you would wake the device, speak your question/command then wait a few minutes for the transcription to come back before Mycroft can act upon it. Honestly you really want (at the very least) less than 2 seconds response time for STT or it just feels too slow. It quickly hits the point where you may as well whip out your phone and open an app or type a search query.

Self-hosting is great, as long as you have a decent GPU on your local network that is always running (at least running while your voice assistants are) which can noticeably add to your power bills.

Someone might publish a plugin that uses a publicly available API, however you would want to trust that API provider with your data and to check the terms of service. If someone in the community creates a plugin that violates a sites terms of service then it's up to each person whether they use it, but it's not necessarily something we can legally distribute as a company.

In terms of an official Mycroft hosted instance - it might be something we choose to host in the future, but it's not something we're working on right at this moment. We'd rather get better on-device STT. Something that can run in real-time on the Pi, and that has a high enough accuracy for the range of vocabulary that people expect a voice assistant to understand. Can't promise anything yet, but we'll see what happens...

The whisper-large-v2 model is available on HuggingFace, and they support a hosted API. I've not used it yet personally, but they appear to support streaming inference. Might be worth exploring!

In terms of running offline you would need some decent hardware for this. I don't believe for example that it would be possible on the Mark II, which has a Raspberry Pi 4 inside. The max RAM you can assign to the GPU on the Pi 4 is 256MB. The smallest ("tiny") Whisper model requires 1GB VRAM. So yeah, unlikely to run on a Pi at all, but I'd be very interested if someone managed it.

There is also whispercpp which uses the same model and only runs on the cpu. https://github.com/ggerganov/whisper.cpp In my tests it works fine on entry level arm hardware. Although it might be a little bit to slow for serious use, but this would need some more testing.