wenet-e2e/wekws

WeKws Roadmap 2.0

robin1001 opened this issue · 5 comments

WeKws is a community-driven project and we love your feedback and proposals on where we should be heading.
Feel free to volunteer yourself if you are interested in trying out some items(they do not have to be on the list).

The following items are in 2.0:

  • Rubustness, improve the robustness by learning acoustic feature rather than other features of the keywords.
  • Various on-device chips support.
  • Unsupervised model or pretrained model exploration.

Hi,Robin. I found that modelscope has opensourced a model that borrow many codes from wekws. They use CTC as loss function and it seems to work well. I think there are two directions as least worth trying:

  • Pretrain in the manner of ASR, and train KWS model with limited data.
  • Develop a customizable wake up word system.

Thank you for opensource wekws anyway! It is wonderful!

duj12 commented

Hi,Robin. I found that modelscope has opensourced a model that borrow many codes from wekws. They use CTC as loss function and it seems to work well. I think there are two directions as least worth trying:

  • Pretrain in the manner of ASR, and train KWS model with limited data.
  • Develop a customizable wake up word system.

Thank you for opensource wekws anyway! It is wonderful!

Hi, I implement this in PR #135, of course I borrow a lot of codes from modelscope, too. Hope this will be a good solution. But there is still some thing need to be done, especially runtime code.

Hi, Robin. My personal feeling is the KWS we have are good but the manner we train and the datasets we use are not so great.
KWS are usually simple classification models and often the dataset is just a binary of KW and !KW as x2 labels that during training often the accuracy curve is a sign of overfitting as the space between KW and the other classification of anything that is not a KW is huge.
Often a 'noise' classification of non spoken audio increases accuracy as adding classifications gives the model something to train to.
Noise & !KW are non KW catch-alls but still distant from the KW of choice
This is where we are likely shy on datasets as phonetically similar words should create further classifications to hug the KW more tightly and give the model data to train harder on and create a less steep learning curve and negate the overfitting model.

The simple models we have are not the problem its the datasets that are not liguistically analysed to create phonetic classification around a KW that we are missing.
I have been trying this using the Multilingual Spoken Words which is just extracted from the Mozilla Common Voice project.
Sadly Common Voice contains huge swaithes of bad data, from wrong labels, bad recordings, alignment and non native language speakers, also the metadata is extremely sparse.

If you had the datasets and also datasets of the device of use then you can make accurate simple and lite KWS so its a bit of a catch-22, but if you picked a model and a device and gave users an option to opt-in then the dataset could be collected as did Big Data with accompany quality metadata that had simple gender,age, region.

Also you can collect locally as with ondevice training you can bias a larger pretrained model with a smaller on device model of a dataset collected locally.

Hi, @robin1001 and @duj12 , any plan for ctc-kws runtime?

I just noticed Mining Effective Negative Training Samples for Keyword Spotting (github, paper)

I have been wondering about a dataset creator and how to select !KW without class imbalance.
I have only had a brief look through the code but is there a dataset creator and also one that implements the above?

Just to add with my own experiments of using 'own voice' I can quickly make a KWS that is very accurate.
For !KW I used phonetic pangrams which sentences that contain all forty sounds of English i.e. they use all the phonemes, or phones, of English (rather than alphabetic characters). Here are a few examples : "That quick beige fox jumped in the air over each thin dog. Look out, I shout, for he's foiled you again, creating chaos."
Being English I assume there are similar other language based phonetic pangrams?

https://github.com/StuartIanNaylor/Dataset-builder was just a rough hack to create a word capture cli boutique to quickly capture 'own voice' KW & !KW as forced aligment is so prone to error (plus is my voice). These are augmented with speex with random noise added to give 2k-4k items in each class.

Its really easy to make very accurate 'Own Voice' KWS but they are totally useless for anyone else.
This is where I think transfer learning could be really important as an initial model could be shipped and via on-device training a smaller model of captures of use could be created to bias the larger model.
Via transfer learning over time through use the KWS would garner accuracy to those who commonly use it.

The small collection of a few Phonetic pangrams surprised me to how accurate the results are and always had a hunch that in larger datasets and how Phones have distinct spectra that the balance of phones and position in the timeframe requires balance, or at least balance uniqueness.