rhasspy/rhasspy3

Wake_word_command (with no pauses or confirmations)

kha84 opened this issue · 2 comments

kha84 commented

Hello there.

Have you seen what OpenVoiceOS guys are doing with their Mycroft fork - https://youtu.be/2D1IZaj2Uws

From the video description it looks like they have made the wake word ack sound (beep) to be played in parallel with the beginning of the recording of a command from mic. This is a hacky way, but it does the job pretty well as you can see yourself on a video.

I was thinking if I'd ever had to implement that myself, I would probably done that bit differently. First of all I would have been recording sound from microphone all the time to some kind of round-robin-buffer. And if the wake word was detected during that, to make a note when exactly it was detected and stream the PCM from that moment of time from a buffer to the ASR module until I see a pause. Streaming (and parallel processing of speech recognition as you talk) should drastically reduce the delay, caused by the sequential nature of current architecture, i.e. when the speech is first get recorded and only then the resulted wav file is fed to ASR.

As a result:

  1. You don't need to make that huge pause yourself to listen that BEEP sound between you say the wake word and you say an actual command
  2. the voice recognition starts earlier, shortly after you started to speak, so by the moment you stop speaking the ASR module will just need to process a tiny bit of PCM stream. This gives more headroom for ASR and allows ppl to use larger and heavier ASR models, with the realtime factor close to 1, without the need of waiting extra seconds for their hardware to recognize what they have just said starting from the first byte of PCM, as it happens now?
  3. this all should greatly improve user experience

Nevertheless, amazing work! thanks a lot for all your contributions to the community.

@kha84 This is exactly how Rhasspy 3 functions, actually 🙂

Audio recorded during wake word detection is kept into a deque with some maximum length (a ring buffer).
When the wake word is detected, this buffer is fed into the speech to text system and then audio chunks are streamed simultaneously to speech to text and silence detection.

There is a timestamp associated with each audio chunk, and I use this in the mic record sample script to rewind the ring buffer and find the closest audio chunk to an event (in this case, VAD). I may add this to the default audio pipeline as well, since it would be more accurate than just dumping the entire ring buffer into ASR.

kha84 commented

@synesthesiam Awesome! You're miles ahead of me. I'll go play with Rhasspy 3 immediately