Wake confirmation tone sometimes picked up as speech
Closed this issue · 12 comments
I tried flashing with the latest version as of Saturday, and enabled this option. It seems to work well, but about 1/4 of the time it immediately stops listening after the tone and displays that it heard "Ding!"
Bad photo below as an example. There was no background noise at the time, and it was pretty clear each time it happened that it was the tone it picked up and not some nearby speech.
This was actually pretty funny, but makes using this option a problem for reliable use of Willow. Maybe a configurable delay is needed between playing the tone and actually listening for speech? Is there a setting like this already? I haven't experimented with this yet, but maybe increasing the VAD delay would help as well? I don't know if it would still pick it up as a "ding" followed by the real speech.
One thing you can probably look at is the record buffer under Advanced Settings. For a variety of reasons and configuration "clashes" (for lack of a better term) this setting has fairly wide ranging effects on all kinds of things and could be related to what's happening here. I'm also wondering if there could be other environmental or situational issues at play. For example - where is "Hello! How can I assist you" coming from?
In terms of a delay, etc we do routine testing with 1,000 loops of audio playback from wake word -> TTS response including combinations of configuration settings, etc. At the risk of sounding like "can't reproduce, closing" I've never seen this.
The VAD delay is how soon to end VAD after it detects the end of speech. It's basically a knob to balance "How fast can you spit out a command vs responsiveness" as higher values give you more opportunity to pause and finish speaking during speech but will obviously wait longer after you have stopped, leading to an increase in overall command execution time.
I also see this, though not as frequently. I have my willow set up to use a REST endpoint and the ESP both chimes on wake and reads back the REST response. Sometimes (maybe 1 in 10 or 20) the chime is transcribed as "Ding." :)
For example - where is "Hello! How can I assist you" coming from?
I've got a custom voice assistant running in HA at the moment. I can't see how that would affect the speech that it hears though - that all occurs afterwards.
My knowledge of acoustics is pretty weak, so this may be nonsense, but:
Could this be related somehow to the environment around the device reflecting the chime back with a short delay? There's a hard cabinet wall right next to it, and a metal fridge on the adjacent wall a few feet away that I'd guess could be causing this?
Ok, just wanted to check. We've found over the months that a lot of people do really interesting things so it's important to ask a lot of questions when you see something out of the norm.
The ESP BOX and framework we use is "supposed to" cancel out audio feedback/echo playing back through the speaker so that it doesn't loop back into the mic (even with reflection and audio delay). There are some configuration options relating to what is basically speakerphone functionality that they may be limited to and we could check these.
Do you both have Acoustic Echo Cancellation enabled under Advanced Settings? It's on by default unless you changed it. Otherwise, we can play around and see if we can reproduce this but like I said it's been testing a couple of thousand cycles (at least) and that didn't happen. We'd know because we look at success rate for the HA response and consider anything less than 99% success rate unacceptable. Failures are almost always counted on one hand and something like 1/20 would lead to an abysmal result.
I'll test out a couple different values for the record buffer and double check on the Acoustic Echo Cancellation setting.
I think it's totally fair to call this a weird edge case though - it sounds like you're testing this pretty thoroughly and would catch it if it were a common issue. Very possible it's somehow related to either a setting I didn't realize I changed, or my particular physical installation somehow.
I do have the echo cancellation on. I also have settings that turn up the speaker volume and allow a longer timeout and speech duration, than the defaults, IIRR: speaker volume at 95%; Microphone Gain 14; Record Buffer 12; Maximum speech duration 5; VAD Timeout 1000.
I'm seeing this on a pretty old ESP Box; I'll try and reproduce on the latest production Box3 later.
Really appreciate the testing/debugging everyone!
@hamishcunningham 95% is pretty loud; that isn't something we've tested this scenario with and I'm kicking myself for not having considered that. I can try a test run with the volume cranked and see if I experience this.
Hi @kristiankielhofner, yes, I figured it might be the volume...
Anyhow I just tried flashing 0.2.1 to my new Box-3 (one of the final ones, with the different coloured bases) and it seems that chime and sound output don't work with the radar/IR sensing base; probably a GPIO conflict I guess -- shall I file an issue for that?
The non-sensing base works fine, and I haven't seen the "Ding!" transcription yet... Which reminds me to ask if your test harness is available? I need to be more systematic about large-scale testing :)
Tnx!
@hamishcunningham I don't know if this is swamp gas or what but see my quick demo video:
https://www.youtube.com/watch?v=ZGFxTDx9A_s
This is with volume set to 100%.
We have seen a weird issue where the BOX-3 can get in some kind of strange state... Usually pulling it out of the base and reseating permanently solves it until you start switching bases around again. We haven't dug into what's happening in the first place but it's not 100% reproducible either.
This is with 0.2.1. As I'm sure you've noticed we haven't updated the client card images in the web ui yet because Espressif is likely going to make a BOX-3 variant with a completely white base and enclosure.
The public test harness is extremely simple. utils.sh
has a torture
argument that plays audio in loops (1000 by default). you can change the audio file to match your wake word, entity, etc as well as the delay between loops by setting vars in .env
. Run that on something with a speaker near your device(s) with the console connected to the playback device.
Then you capture the monitor/log output in another window and can do simple things like grep 'WAKE_END' | wc -l
to see any misses for whatever event you're looking for in the logs. With WAS Command Endpoint Mode, for example, you can grep speech | grep true | wc -l
and it will count all the way from wake to the action being completed on the command endpoint. Our internal test harness does some additional things like stream logs to a parser/DB/etc but the fundamentals are just from uitls.sh torture
.
Ok great, thanks. I reseated the sensor base and it did indeed work perfectly, sorry for the false alarm!
I did some more testing with a couple of combinations of settings, and I believe my experience is most likely due to waiting too long for the wake confirmation tone and allowing the VAD timeout to elapse.
Without the tone, I've been using it as "Hi ESP ", but with the tone enabled, I generally say something like "Hi ESP ". From tweaking this setting a bit, I realized that every time the "ding" was captured, I had probably waited just over the 300 ms that was configured as the default timeout. Then the only sound it has to go on is the tone, so it attempts to use that as the transcription.
I noticed from the video posted in the thread above that there's no pause between the wake word and the command. That totally makes sense to use it this way since it can handle it, but the wake tone just trained me into waiting for it.
tl;dr I don't think there's a bug here, I just talk too slowly for the settings I was using 😆
A possible usability improvement here might be to artificially extend the delay until after the tone has completed playing, but that might be finicky to get right.