Where to find datasets for wake word detection training?
Opened this issue · 2 comments
Hello. I think it would be nice to include some links or hints in the README about this.
The problem is that I have no idea where to find those. I was recommended to use these ones https://github.com/Picovoice/wake-word-benchmark/tree/master/audio but you'll still need to collect the records with no wakeword sound.
What I did with the wakeword I built is to collect some minutes of records from a podcast through the microphone by running a for loop in a bash script using rustpotter-cli records --ms 2000 $i.wav
, there were still many false positives in my live tests so I just collected a bunch of false positives using rustpotter-cli spot -t 0.8 --record-path ./noises trained.rpw
(the 'noises' folder needs to exists), I also added some more positive detections recorded the same way to balance the numbers, at that point the medium and large model sizes started to be confident.
I also noticed that the audio quality matters, I collected all the initial records using my MacBook microphone with captures very clean sound, when I changed to use the model on my Jabra speaker with a Raspberry PI, I have to take some records there and added it to the dataset to achieve similar performance, because it captures the audio with some minor eco and background noise. That is another thing that stopped me to trying to collect and share a dataset, at the end it seems like a task that requires a group of people with several devices to be involved in.
Currently I'm using a medium size model with threshold 0.93, min counter 15, and the gain normalizer filter and I having a pretty good experience where the detection works most of the time even when I'm watching tv.
In case you are interested on my setup, I'm using it with OpenHAB with a whisper.cpp add-on I'm working on (for voice generation I'm still using a cloud service), and it gives me an acceptable experience, my server is running in an Orange Pi 5, I will get a Raspberry 5 in a couple of weeks, which can be overclocked to 3.0hz I think, I hope it works a little faster there. I'm using a small fine tuned whisper model for Spanish I found on HF. As speaker I'm using a Jabra Speaker2 40 connected to a Raspberry Pi Zero 2 W (previously I was using an older Jabra speaker but the sound was not too good as commented).
video_2023-11-03_12-55-24-2.mp4
The setup is summarized here https://community.openhab.org/t/dialog-processing-with-the-pulseaudiobinding/148191, still I don't think anyone has tried it and succeed.
There is Mozilla Common Voice. And the words cut are provided by mswc. I could open a PR.