pyxyyy/gesture-drumkit

Create a ML model to detect gesture onset

Closed this issue ยท 13 comments

Data Preparation

  • Collect data (up/down only, 50 each, keep it simple first)
  • Visualisation (Jupyter notebook)
  • Normalise and smooth using moving average window
  • Slice and label
  • *Refactor to process multiple CSVs

Training

  • Preprocessing
  • Build a single layer LSTM network (any better architectures?)
  • Export model (convert from Keras to TF)

Integration

  • Setup Tensorflow Lite on Andorid
  • Import model and add prediction code
  • Tie up other loose ends

Further Improvement (Don't think there's enough time though:/)

  • Collect more data
  • Blah blah blah

Remember to save the std & avg, so we can apply the same preprocessing during inference.

I'm not sure if using a LSTM makes sense though.
LSTM is usually used to predict the few tokens, when given a bunch of tokens.
(Eg: given a seq of words, predict the next N words that should come after it)
A classification model might be more appropriate. (given a seq of tokens, predict a class)

Remember to save the std & avg, so we can apply the same preprocessing during inference.

Sure will document the steps. But do we need to save the std & avg for the training data? I was thinking of something like per-example mean subtraction(i.e. calculated per window).

(Not exactly sure what preprocessing steps are needed. I remember last time we did something like applying simple normalisation(mapping to [-1, 1] range) then followed by a low-pass filter so as to smooth out the signal. Kindly help to add in more steps if you think it helps!)

I'm not sure if using a LSTM makes sense though.
LSTM is usually used to predict the few tokens, when given a bunch of tokens.
(Eg: given a seq of words, predict the next N words that should come after it)
A classification model might be more appropriate. (given a seq of tokens, predict a class

I kind of had the same impression on LSTM. I found some articles suggests that LSTM can also do classification:
https://www.analyticsvidhya.com/blog/2019/01/introduction-time-series-classification/
https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
Will that fit our use case?

mean & std

Usually during normalization (for imgs), we subtract the mean & divide the std of the entire training data. The training data is assumed to be representative of data obtained in testing.

Then during testing, we just apply the same process as above, but using the "representative" mean & std obtained from training.

LSTM

Hmm you make a good point. Lets try it out. Looks like this guy tried something similar too.
Maybe we can skip keras altogether. Not sure if the convert keras to TF will work well, but the last time I tried was a long time ago.

https://medium.com/@curiousily/human-activity-recognition-using-lstms-on-android-tensorflow-for-hackers-part-vi-492da5adef64

The tricky part will be segmenting/annotating the data.
I think we can try something like this:

  1. Look for peaks
  2. Based on direction, guess what the peak is (up or down)
    Maybe we can analyze one model peak that we know is up or down. Then, we do a cosine similarity for all unknown peaks with that model peak.
  3. Grab frames around the peaks (15 frames before, peak frame, 10 frames after)
    Not sure about the number of frames. also depends on our window size / looking at the data. I assume we need more before the peak, than after.

image

Current implementation: (preprocess.ipynb)

  • Look for peaks in the RMS of accelerometer readings. The peaks must be:
    • at least larger than the mean RMS AND
    • separated from each other by a minimum of 100 samples (see code comment)
  • Since RMS is always positive, no need to guess for the direction.
  • Set a window of 50 frames(250ms). Grab 35 frames before and 15 frames after the peak. 35 and 15 are chosen so as to capture the complete rising and falling edge of the pulse.
  • Save each slice to a new file. Gestures of the same category are placed in the same folder.

Implemented the model and gave it a try. The code seemed to work but apparently we need more data.

Train on 74 samples, validate on 9 samples
Epoch 1/10
74/74 [==============================] - 2s 28ms/step - loss: 0.1656 - acc: 0.9324 - val_loss: 1.5297 - val_acc: 0.7778

Epoch 00001: val_acc improved from -inf to 0.77778, saving model to best_model.pkl
Epoch 2/10
74/74 [==============================] - 1s 12ms/step - loss: 0.2283 - acc: 0.9324 - val_loss: 0.4513 - val_acc: 0.7778

Epoch 00002: val_acc did not improve from 0.77778
Epoch 3/10
74/74 [==============================] - 1s 12ms/step - loss: 0.2143 - acc: 0.9054 - val_loss: 0.4557 - val_acc: 0.8889

Epoch 00003: val_acc improved from 0.77778 to 0.88889, saving model to best_model.pkl
Epoch 4/10
74/74 [==============================] - 1s 16ms/step - loss: 0.0359 - acc: 0.9865 - val_loss: 0.8811 - val_acc: 0.7778

Epoch 00004: val_acc did not improve from 0.88889
Epoch 5/10
74/74 [==============================] - 1s 14ms/step - loss: 0.0378 - acc: 0.9865 - val_loss: 0.8776 - val_acc: 0.7778

Epoch 00005: val_acc did not improve from 0.88889
Epoch 6/10
74/74 [==============================] - 1s 13ms/step - loss: 0.0330 - acc: 0.9865 - val_loss: 0.8486 - val_acc: 0.8889

Epoch 00006: val_acc did not improve from 0.88889
Epoch 7/10
74/74 [==============================] - 1s 13ms/step - loss: 0.0320 - acc: 0.9865 - val_loss: 0.8415 - val_acc: 0.8889

Epoch 00007: val_acc did not improve from 0.88889
Epoch 8/10
74/74 [==============================] - 1s 14ms/step - loss: 0.0319 - acc: 0.9865 - val_loss: 0.8700 - val_acc: 0.8889

Epoch 00008: val_acc did not improve from 0.88889
Epoch 9/10
74/74 [==============================] - 1s 13ms/step - loss: 0.0315 - acc: 0.9865 - val_loss: 0.9194 - val_acc: 0.8889

Epoch 00009: val_acc did not improve from 0.88889
Epoch 10/10
74/74 [==============================] - 1s 15ms/step - loss: 0.0314 - acc: 0.9865 - val_loss: 0.8191 - val_acc: 0.8889

Epoch 00010: val_acc did not improve from 0.88889

We can try adding negative examples as well to represent no gesture detected.
Just let the watch record while we don't move our hand much, or move our hand really slowly / in directions that are not up or down.

I'll try and record some additional data later too.

Maybe we might get better results if we forcefully pair the gyroscope & accelerometer data.
If I'm not wrong, currently the lstm is fed all the accel first, then gyro.
The accel info is visited pretty long ago so the lstm might not really 'remember' the accel info.

Scrap that, let's go for an fc model.
Input : (50+50),4

Layers:
256 Dense + relu
Flatten
256 Dense + relu
3 Dense + softmax

Output shape: (3, 1)
Eg: [1,0,0] where the idx 0 is down gesture, idx 1 is up, idx 2 is none

Maybe we might get better results if we forcefully pair the gyroscope & accelerometer data.
If I'm not wrong, currently the lstm is fed all the accel first, then gyro.
The accel info is visited pretty long ago so the lstm might not really 'remember' the accel info.

Actually you are making a point. If we can 'zip' the gyroscope & accelerometer data together we are kind of forcing a strong correlation between the two events. Maybe that can improve the accuracy.

We can pair adjacent sensor data together s.t. instead of (50+50, 4) it becomes (50, 4+4). I can make it an switchable flag in the preprocessing script.

// edited

Since we can't use a lstm, we can just make sure that the data input into the model are at consistent locations. 50 accel first then 50 gyro, or paired might be similar. We can assume that the model will learn the temporal relationship between each data point. As long as we consistently feed in the data into the right positions, it should be fine.

Having each data point contain the sensor type sounds like a good idea though. I didn't think about that. The model can potentially learn from information that we got for free, especially when the data from both sensors can look really similar after normalization.

Yes you're right. Let's keep it in the same order (50 accel then 50 gyro).

Having each data point contain the sensor type sounds like a good idea though.
So the input becomes (50+50, 5)? i.e. [sensor_type, x, y, z, rms]

I think RMS can be actually ignore because

  • it's inferred from x, y, z, which provides no additional information.
  • gestures in all direction will have the same/similar rms pulse so it can't really help to tell the direction.

image

I'm thinking should we take multiple slices around the peak?

For example, the peak is at index 1000, given a window size of 50,

  • originally we take [965, 115], 35 frames before and 15 after
  • we can slide the window and take 10~20 extras e.g. [645, 1005], [646, 1006],...,[985, 1025] and label them all as gesture detected.

It's similar as what you have suggested 'take some range from the left of a gesture and label it as negative data'. Basically we can(or should) apply a sliding window to the entire data and label all windows. I guess it can provide better coverage on some similar cases and basically give us more training data/information.