/esphome-on-device-wake-word

Detect wake words for ESPHome's voice assistant component on the device

Primary LanguagePureBasic

ESPHome 2024.2 officially includes the micro_wake_word component. The latest training framework is available in the microWakeWord repository. All future updates will happen in microWakeWord and ESPHome directly, so this repository is now archived.

On Device Wake Word Detection for ESPHome's Voice Assistant Component

This component implements wake word detection on the ESPHome device itself. It currently implements "Hey Jarvis" as the wake word, but any custom word/phrase is possible after training a new model. The micro_wake_word component starts the assist pipeline immediately after detecting the wake word without using Wyoming-openWakeWord.

It works well with comparable performance to openWakeWord. The detection latency is extremely low, nearly always faster than an ESP32 device using the openWakeWord pipeline.

Wake word detection is done entirely with TensorFlow Lite Micro. Wake word models are trained without using Espressif's proprietary Skainet and can be customized without samples from different speakers. Sample preparation, generation, and augmentation heavily use code from openWakeWord.

The target devices are ESP32-S3 based with external PSRAM. It may run on a regular ESP32 but may not perform as well.

It is currently not trivial to train a new model. I am developing a custom training framework to make the process much easier!

YAML Configuration

See the example YAML files for various S3 box models.

Benchmarks

Benchmarking and comparing wake word models is challenging. It is hard to account for all the different operating environments. Picovoice has provided one benchmark for at least one point of comparison.

The following graph depicts the false-accept/false-reject rate for the "Hey Jarvis" model compared to the equivalent openWakeWord model. FPR/FRR curve for "hey jarvis" pre-trained model

Graph Credit: dscripka

For a more rigorous false acceptance metric, we tested the "Hey Jarvis" on the Dinner Party Corpus dataset. The component's default configuration values result in a 0.187 false accept rate per hour.

Detection Process

The component detects the wake word in two stages. Raw audio data is processed into 40 features every 20 ms. Several of these features construct a spectrogram. A streaming inference model only uses the newest slice of feature data as input to detect the wake word. If the model consistently predicts the wake word over multiple windows, then the component starts the assist pipeline.

The first stage processes the raw monochannel audio data at a sample rate of 16 kHz via the micro_speech preprocessor. The preprocessor generates 40 features over 30 ms (the window duration) of audio data. The preprocessor generates these features every 20 ms (the stride duration), so the first 10 ms of audio data is part of the previous window. This process is similar to calculating a Mel spectrogram for the audio data, but it is lightweight for devices with limited processing power. See the linked TFLite Micro example for full details on how the audio is processed.

The streaming model performs inferences every 20 ms on the newest audio stride. The model is based on an inception neural network converted for streaming. It executes an inference in under 10 ms on an ESP32-S3, much faster than the 20 ms stride length. Streaming and training the model uses modified open-sourced code from Google Research found in the paper Streaming Keyword Spotting on Mobile Devices by Rykabov, Kononenko, Subrahmanya, Visontai, and Laurenzo.

Next Steps and Improvement Plans

  • Make the model training process more straightforward.
  • Generate and provide more pre-trained models.
  • Make it easy to switch between models in the YAML config.

Model Training Process

We generate positive and negative samples using openWakeWord, which relies on Piper sample generator. We also use openWakeWord's data tools to augment the positive and negative sample. In addition, we add other sources of negative data such as music or prerecorded background noise. Then, we train the two models using code from Google Research. The streaming model is an inception neural network converted for streaming.

Acknowledgements

I am very thankful for many people's support to help improve this! Thank you, in particular, to the following individuals and groups for providing feedback, collaboration, and developmental support: