/small-footprint-keyword-spotting

Effective processing pipeline and advanced neural network architectures for small-footprint keyword spotting

Primary LanguagePython

Small-footprint keyword spotting

Group members // Daniele Ninni, Nicola Zomer

In this work, we experiment with several neural network architectures as possible approaches for the keyword spotting (KWS) task. We run our tests on the Google Speech Commands dataset, one of the most popular datasets in the KWS context. We define a CNN model that outperforms our baseline model and we use it to study the impact of different preprocessing, regularization and feature extraction techniques. We see how, for instance, the log Mel-filterbank energy features lead to the best performance and we discover that the introduction of background noise on the training set with a reduction coefficient of 0.5 helps the model to learn. Then, we explore different machine learning models, such as ResNets, RNNs, attention-based RNNs and Conformers in order to achieve an optimal trade-off between accuracy and footprint. We find that this architectures offer between a 30-40% improvement in accuracy compared to the baseline, while reducing up to 10x the number of parameters.

Notebooks

  • Data analysis and preprocessing inspection
    This notebook takes care of loading and preparing the dataset, splitting it into training, validation, and testing sets. It also provides some information about the dataset with plots. Moreover, it introduces the functions used to preprocess the data (for example adding noise).

  • Keyword spotting: general training notebook
    This notebook defines the general training and testing pipeline, giving some information about the validation metrics used. The training is performed using our baseline model cnn-one-fpool3, taken from [Arik17].

  • Bayesian optimization and feature comparison with CNN
    This notebook is used to train our custom CNN models. With the first of these models we perform a Bayesian optimization, and we use it for inspecting the importance of dropout and batch normalization, realizing a feature comparison and studying the effect of data augmentation on the training set.

  • Keyword spotting: ResNet architecture and triplet loss implementation
    In this notebook we play with ResNet models for the keyword spotting task. We start by implementing a simple ResNet architecture inspired by [Tang18] and then, motivated by [Vygon21], we modify such model and we train it to get a meaningful embedded representation of the input signals. We finally use k-NN to perform the classification task on these intermediate representations.

  • Keyword spotting: a neural attention model for speech command recognition
    This notebook implements an attention model for speech command recognition. It is obtained as a modification of a demo notebook prepared by the authors of the paper A neural attention model for speech command recognition.

  • Keyword spotting: Conformer
    In this notebook, thanks to the library audio_classification_models, we implement a baseline Conformer architecture inspired by [Gulati20]. This model combines Convolutional Neural Networks and Transformers to get the best of both worlds by modeling both local and global features of an audio sequence in a parameter-efficient way. In detail, we use only one Conformer block in order to reduce the number of model parameters. Moreover, we perform hyperparameter tuning by means of Bayesian optimization in order to find, among the models with less than 2M parameters, the one that leads to the best accuracy.

  • Keyword spotting: GAN-based classification
    In this notebook we try to implement a GAN-based classifier inspired by the paper GAN-based Data Generation for Speech Emotion Recognition. Unfortunately, to date we have not been able to figure out how to properly train the generator and discriminator in this specific case. As a result, we cannot currently test this approach.

Utilities

Demo App

In this repository you can find a demo application that can be run as a Python script with python demo_ks.py. It allows you to select the model you want to use and, when started, it detects the words of the Speech Commands Dataset through the microphone (or any chosen input device).

You can also find a notebook that can be used to play some commands from the dataset, in order to test such application with non-real-time signals.


Human Data Analytics
University of Padua, A.Y. 2022/23