Embedded Machine Learning

Introduction

This project aims at implementing and testing machine learning algorithms adapted for an embedded environment. The following classifiers were used:

Classification and Regression Trees (CART)
Support Vector Machines (SVM)
Artificial Neural Networks (ANN)

The task at hand is the identification of musical styles, which is tackled using the GTZAN Genre Collection dataset. Here, each audio file is stored using the AU file format. A reader has been implemented for this file format, as well as a feature extraction using a Discrete Short-time Fourier transform.

A full description of the project can be found in the original repository, together with the base helper routines provided.

Setup

The project is organized as follows:

.
├── ANN
│   ├── prediction_ann.cpp
│   └── train_ann.py
├── CART
│   ├── CART.py
│   ├── Crop recommandations
│   ├── Crop recommandations.pdf
│   ├── prediction_cart.cpp
│   └── train_cart.py
├── DATA
│   ├── ann_stats.txt
│   ├── ann_weights.txt
│   ├── blues
│   ├── classical
│   ├── country
│   ├── disco
│   ├── features.csv
│   ├── features_prof.csv
│   ├── features_testing.csv
│   ├── features_training.csv
│   ├── file_list_test.txt
│   ├── file_list_train.txt
│   ├── hiphop
│   ├── jazz
│   ├── metal
│   ├── pop
│   ├── reggae
│   ├── rock
│   ├── svm_coeff.csv
│   ├── svm_feat_stats.csv
│   └── tests.org
├── eval.sh
├── Evaluation
│   ├── evaluate_ann.cpp
│   ├── evaluate_cart.cpp
│   └── evaluate_svm.cpp
├── Extraction
│   ├── features_extraction.cpp
│   ├── features_extraction.h
│   └── main.cpp
├── Helpers
│   ├── au_reading.h
│   ├── etypes.h
│   ├── file_helpers.h
│   ├── globals.h
│   ├── music_style_helpers.cpp
│   ├── music_style_helpers.h
│   ├── print_helpers.h
│   ├── signal.h
│   └── wav_reading.h
├── readme.org
├── RF
├── setup.sh
└── SVM
    ├── prediction_svm.cpp
    └── train_svm.py

18 directories, 38 files

Each of the classifiers has its folder (ANN/, CART/, RF/, SVM/) with a training script, usually in Python, for it does not have to be embedded, and a classifier in C++.

DATA/: Contains the audio files from the dataset. They were not included because of their size, but this view shows the necessary organization for reproducible results.
Extraction/: Contains the feature extraction code, implemented only using STFT.
Helpers/: Contains a multitude of utilities, such as for signal processing, global definitions, and labels management.
Evaluation/: Contains programs that measure the accuracy, execution time and memory usage of the algorithms. They are invoked individually for each classifier.

setup.sh and eval.sh are the tangled snippets presented in this file, with everything up to model training in the former, and the performance evaluation in the latter.

Requirements

All of the libraries used should be included in a standard building system installation. The only addition is Tensorflow, for training neural networks. Installation instructions can be found in the official documentation.

Usage

The first step is to compile the source code, which can be done with the following block, working for both a personal computer and a Raspberry Pi board.

# build feature extraction
mkdir -p build/setup
cmake -S Extraction -B build/setup  # "Extraction" may be replaced by the desired module
make -C build/setup                 # or . for the entire project

Only the needed module is compiled now, as some files needed for further building the modules are generated in the following steps. An example is CART, which will only have its prediction tree after the training done in Python.

Feature extraction

The extraction of features from the audio files must happen before the training steps. In our case we’ll be using only the dataset, which is split into training and testing data. This process takes around 70 in the tested machines.

It outputs to the data folder the files file_list_train.txt and file_list_test.txt, with the file paths for training and testing. The testing one is the input for the evaluation binaries, if no other is provided.

The features are extracted to the files features_training.csv and features_testing.csv. They correspond to the aforementioned data and are already computed to avoid repeated calls.

Having the project built, the extraction can be executed with the following command:

# extract features
./build/setup/EXTRACTION

CART

In order to use the CART algorithm, you must first build the classification tree:

# train cart tree
python3 CART/train_cart.py

This will generate the file CART/prediction_cart.cpp, with a function corresponding to a sequence of if/else’s analog to the trained binary tree paths.

SVM

So to use the SVM model, we must first also execute the related Python script:

# train svm model
python3 SVM/train_svm.py

This will generate the files DATA/svm_coeff.csv, with the weights and bias for the hyperplanes in the model, and DATA/svm_feat_stats.csv, with the statistical attributes of the features used in training, so that it can be replicated during prediction of new data.

ANN

The usage of the ANN method requires first the training of the associated neural network, also specified in a Python script:

# train ann
python3 ANN/train_ann.py

This will generate the file DATA/ann_weights.txt, with the weights and bias for all layers of the trained model.

Evaluation

In order to run the evaluation of the learning methods, it is necessary first to compile the Evaluation module, as in the following block:

# build evaluation code
mkdir -p build/eval
cmake -S Evaluation -B build/eval -DUSE_TESTS_FILE=OFF -DVERBOSE=OFF
make -C build/eval

The features used for testing have already been computed in the Extraction module, and the necessary learned attributes from their training scripts. As said before, everything is stored in the Data/ folder.

If desired, it is possible to use the compilation variable USE_TESTS_FILE so to ignore the extracted features and extract them from a list of music files paths in DATA/file_list_train.txt. This file is also redundantly computed during extraction, matching the separation of training and testing datasets, so if no modification in done they should both provide the same results, it will just take longer if left on.

The VERBOSE compilation variable was also added to increase the information in stdout during evaluation. Most notably, it will add specify what is being processed in the various steps of the methods, and during evaluation will print a labeled confusion matrix instead of plain numbers (added in Helpers/print_helpers.h), such as bellow.

Confusion matrix:
               blues classical   country     disco    hiphop      jazz     metal       pop    reggae      rock
     blues        17         0         4         2         0         1         2         0         3         1
 classical         0        28         0         1         0         1         0         0         0         0
   country         3         1        13         3         0         1         2         0         0         7
     disco         0         0         3        12         5         0         1         2         5         2
    hiphop         2         0         2         3        17         0         3         1         2         0
      jazz         3         2         0         0         0        24         0         0         1         0
     metal         1         0         0         2         1         0        24         0         0         2
       pop         1         0         4         0         3         0         0        21         1         0
    reggae         2         0         2         2         4         0         3         3        14         0
      rock         2         1         5         1         0         2         5         2         1        11

To use those options:

cmake -S Evaluation -B build/eval -DUSE_TESTS_FILE=ON -DVERBOSE=ON

An executable is generated per method, an no extra arguments are required as everything extracted from specific files and compilation options, with execution exemplified bellow:

# CART evaluation
./build/eval/EVALUATION_CART

# SVM evaluation
./build/eval/EVALUATION_SVM

# ANN evaluation
./build/eval/EVALUATION_ANN

Here, the codes related to the prediction using each algorithm is stored in their respective folder, and they are used for the statistical performance analysis defined within the Evaluation module.

Implementation

CART

The generation of the CART profits from the implementation made available from the original repository, having it only to be adapted for the already split data and some testing was done with varying max depths. Some problems were encountered in its usage from failing expansion of the nodes, with invalid thresholds, but we could not find the source of the problem. Sometimes the training fails with a TypeError, but with repeated executions it works.

The main addition was the C++ code generation from the learned tree, were a recursive depth-first exploration generates conditionals that replicate the paths in code. The generated file is called SVM/prediction_svm.cpp, and has a function that retrieves a literal string with the class from a given feature vector.

Time complexity

During training, considering d as the number of dimensions in a feature and n as the number data elements used, we can notice that each feature requires to have all of its dimensions analyzed for information gain and every insertion has to walk a path in the tree, which is a logarithm of the number of elements in it. With this, we can say that in order to create a CART from n data it will constitute something close to $O(n*d*log(n))$.

During prediction, nothing but the transversal of a binary tree is performed, which corresponds to $O(log(d))$ complexity. In fact, the complexity should be in terms of the size of the tree, taking a balanced binary tree of n nodes, and it is assumed here that there is a linear connection to the nodes and the number of dimensions.

Space complexity

The memory used for creating the CART corresponds to a single copy of the nodes of the tree built so far, being linear with respect to the number of nodes.

SVM

The SVM algorithm was first tested on python using the =linearSVC= function from sklearn, with accuracy values that did not match the ones informed by the professor with his feature set. With that in mind, the pipeline was changed so to use the =SVC= function with a linear kernel, consistently providing accuracy values over 10% higher (around 64%, better shown in #sec.results).

The interpretation of the coefficients for predicting the classes from new data in C++ had to change as well, as they represent the plans dividing the results in 1 vs 1 duels withing all classes, in a total of 45. This is better explained in the multi-class classification section of its documentation. In practice, a 45x512 matrix is traversed row-wise, computing the inner product with the feature vector, so to retrieve a value representing the division between the compared classes, where positive values represent a “win” for the first class of the pair, and the second class otherwise.

The training script is located at SVM/train_svm.py and the prediction at SVM/prediction_svm.cpp.

Time complexity

For training, taking d as the number of dimensions in a feature and n as the number of data elements, when dealing with a quadratic programming problem, the complexity will be of $O(n * d^2)$. This is the core of the solver and other complexities disappear when dealing with the worst case.

For predicting, now considering c as the number of classes and $r = (c * (c-1)/2$ the number of duels performed, the result will come from processing Y in $Y = A × X + B$, where A is a matrix of r rows and d columns with the learned SVM coefficients, and B vector of size r storing the bias. Being a single threaded application, $O(c * r * n)$ approximates the execution of its two nested loops for n data, ignoring the “processing” part of Y as it’s linear to the size of Y (a.k.a. number of rows in A)

Space complexity

In terms of memory usage, it should require only the ensemble of features loaded, meaning that it would occupy a space proportional to $d * n$.

ANN

The training part of the implementation was done in Python using the Tensorflow library, as mentioned in #sec.requirements. Here, a 3 layered neural network was built using 2 dense layers with ReLU activation and 1 output layer with softmax activation, as the results may be interpreted a probability distribution and their sum should total 1.

The RMSprop optimizer was added to the training of the model, balancing the step sizes according to the magnitude of the gradient during back-propagation. A normal sparse categorical cross entropy loss function was used, working with the type of data we have. The model was training with a validation split of 60/40.

For performing predictions of classes in C++, the feed-forward step was implemented for a network of arbitrary architecture, but limited to ReLU and softmax activation functions. A custom text file is generated from the training script in order to save the weights and biases, with information on how many layers there are, and each layer starting with information of its dimensions. Each layer is represented by a MxN matrix, where each of the M rows represent a neuron composed of N weights, matching the input.

Recent additions to the C++17 and C++20 standards were used to make comprehension easier and more straightforward, those being transform, transform_reduce and inner_product functions, some also present in the SVM implementation.

The training script is located at ANN/train_ann.py and the prediction in ANN/prediction_ann.cpp.

Time complexity

For computing the complexity during training, we’ll consider n as the number of dimensions in a feature, m as the number of data elements, l as the number of layers, n_i as the number of neurons in the layer i. For arriving at the complexity we’ll consider the two main steps: feed-forward and back propagation. For the testing step, it will correspond to the worst case between the two, only the first one for the prediction step.

When performing the feed-forward, so to generate a class prediction, each feature vector of size n will lead to l matrix multiplications, always between matrices of shape n_i-2 x n_i-1 and n_i-1 x n_i, where i in {0..l}. This means that operations of n^3 time complexity will be performed l times, with an added bias addition that can be ignored for notation. Between layers, either a softmax or ReLU activation is applied, but those are linear to their input and will disappear in worst case, leading to $O(l * n^3)$ complexity.

The back-propagation complexity can be obtained with similar reasoning, arriving at something similar to $O(l * n^3)$. As this step happens in sequence of the training, their sum will have the same order of magnitude and so will their time complexities.

Space complexity

During execution, only the active layers being processed need to remain in memory, as well as the feature vector input, arriving at a memory consumption relative to the size of the layers and how many connections there are. Dense layers will use the most space, for they have the maximum number of connections between neurons and previous outputs, but still in the order of $n_i-2 x n_i-1 \text{and} n_i-1 x n_i$,

Results

Executing each of the evaluation programs we can see the following results:

sh ./eval.sh

CART accuracy: 0.47
Confusion matrix:
               blues classical   country     disco    hiphop      jazz     metal       pop    reggae      rock
     blues         8         1         6         2         2         3         0         0         2         6
 classical         0        27         1         0         0         1         0         1         0         0
   country         7         0         8         0         0         3         1         1         4         6
     disco         1         0         4        10         3         1         3         2         3         3
    hiphop         0         0         1         5        14         1         3         0         1         5
      jazz         1         2         3         0         0        21         1         1         0         1
     metal         0         0         2         2         1         0        15         0         0        10
       pop         0         0         0         1         4         1         0        21         3         0
    reggae         3         0         1         4         3         0         0         2        14         3
      rock         6         0         7         3         0         2         5         3         1         3

SVM accuracy: 0.683333
Confusion matrix:
               blues classical   country     disco    hiphop      jazz     metal       pop    reggae      rock
     blues        23         1         3         0         0         0         0         0         2         1
 classical         0        29         0         0         0         0         0         0         0         1
   country         4         1        15         2         0         0         0         2         2         4
     disco         2         1         0        17         4         0         1         2         3         0
    hiphop         0         0         0         3        20         0         0         1         5         1
      jazz         4         2         0         0         1        21         0         0         2         0
     metal         0         0         2         0         1         1        25         0         0         1
       pop         0         0         1         2         6         1         0        20         0         0
    reggae         2         1         1         2         4         0         0         0        20         0
      rock         0         1         1         6         0         2         2         2         1        15


ANN accuracy: 0.64
Confusion matrix:
               blues classical   country     disco    hiphop      jazz     metal       pop    reggae      rock
     blues        20         0         2         1         0         1         0         0         3         3
 classical         1        26         1         0         0         0         0         0         1         1
   country         3         0        17         1         0         1         1         1         0         6
     disco         1         0         1        16         2         2         0         3         2         3
    hiphop         0         0         0         4        18         0         2         0         6         0
      jazz         3         2         2         0         0        21         0         0         2         0
     metal         0         0         0         1         2         1        20         0         0         6
       pop         0         1         0         2         1         0         0        25         1         0
    reggae         2         1         2         2         3         1         0         1        17         1
      rock         3         0         2         7         0         1         2         1         2        12

We can see here that the CART clearly has the worst results, while SVM and ANN arrive at similar precision. The ANN can probably be better explored with different architectures, if more time was available, so to achieve better results.

The programs have worked in the target machine, as displayed in the video that can be accessed here: https://youtu.be/kAHLz-Ts-24

birromer/embedded-machine-learning

Embedded Machine Learning

Introduction

Setup

Requirements

Usage

Feature extraction

CART

SVM

ANN

Evaluation

Implementation

CART

Time complexity

Space complexity

SVM

Time complexity

Space complexity

ANN

Time complexity

Space complexity

Results