rhasspy/rhasspy3

Two small issues in the VAD->ASR pipeline processing

Opened this issue · 1 comments

kha84 commented

Hello again. I've been playing with Rhasspy3 over the course of last several days and I found few small issues.

For a beginning, here's my pipeline, pretty standard, all the programs settings are default as per your tutorial. The only small diff here is that I swapped whisper for vosk to try it out some additional ASR models:

pipelines:
  default:
    mic:
      name: arecord
    wake:
      name: porcupine1
    vad:
      name: silero
    asr:
      name: vosk.client
    handle:
      name: repeat
    tts:
      name: piper.client
    snd:
      name: aplay

Issue number one - the silero-vad sensitivity? other VADs?

When the wake step is activated (by me saying a wake word to microphone), it sometimes happens that VAD (silero) is unable to capture me speaking, so the pipeline is keep hanging on that step unit the timeout from VAD. It could be because my microphone is noisy. Or I was just speaking too quiet. Or the default threshold of silero-vad is just a bit too low for my setup.

I guess, I'm not the only one who'll be playing with Rhasspy3 while using questionable quality microphones, with a constant background humming tone heard on records from such microphones :) so I was trying to find some ways to configure silero-vad to make it more sensitive but it looks like there're no such settings exposed to configuration.yaml right now. So this one is more like a gentle low priority feature request. In my particular case, I think I should just invest some bucks into some more decent mic compared to what I have now, prob an electret one.

Then I tried to swap silero-vad with something different. There are two other VADs Rhasspy3 is shipped with: energy and webrtcvad and I seen both of them are having some kind of sensitivity configurable in the configuration. I used similar steps as in your tutorial to get them installed, but as soon as I plug them into pipeline it starts to spill error messages on me. It looks like they're not quite ready, right? Or was it just me doing something wrong? I'm happy to dig more into that myself, if you tell me that both energy and webrtcvad are working out of the box.

Issue number two - even if VAD wasn't triggered, the captured PCM is still sent downwards to the pipe to ASR

Consider this scenario:

  1. saying a wakeword loud
  2. seeing the pipeline went to the point of VAD
  3. silently whispering something, without triggering VAD
  4. seeing VAD is not being activated
  5. after a timeout from VAD (something like 10-15 seconds I guess), whatever audio was captured by microphone is still sent to ASR and then the recognized text is sent to HANDLER.

The issue, as I see it, in the point 5. If VAD is presenting in the pipeline configuration (it is mandatory right now I guess), but it wasn't triggered for whatever reason, after the timeout happens the pipeline shouldn't be sending captured audio down to ASR, because otherwise - what's the point of having VAD here? :) If someone still needs this kind of behavior to send PCM to ASR even without VAD being triggered, I guess it can be made configurable from the configuration.yaml perspective.

That's it so far. Again - great piece of software!

kha84 commented

Btw I do have created a Dockerfile and simple bash file that detects what kind of setup your host is having (Pulse / Pipewire) and creates a proper docker image with Rhasspy3 preconfigured and spins up a container out of it. I'm still polishing it, will take me few days, but once I'm done I'll share it for sure