This repository aims to provide a unified interface to datasets for the task of accelerometer-based Human Activity Recognition (HAR). The philosophy is to catalogue as many datasets as possible from a wide variety of recording conditions.
terms of data format, feature extraction, label space, sampling frequency, device location/orientation, etc. for the purpose of understanding the efficacy of transfer learning, online learning, lifelong learning, data representation, feature extraction across a large collection of datasets.
This project follows the DataScience CookieCutter template with the aim of facilitating reproducible models and results. the majority of commands are executed with the make
command, and we also provide a high-level data loading interface.
All data will be translated to the following a simple CSV format with the following columns:
time, subject_id, sequence_id, activity_labels, fold_id, x, y, z
where time
is in seconds and of the type double, subject_id
is an integer identifier of the subjects, sequence_id
identifiers of contiguous activities (one subject may therefore perform a task several times), x
, y
, z
are the x, y, and z axis data (whether acceleration, magnetometer, or gyroscope), and activity_labels
are the labels of the dataset. Finally, fold_id
is an identifier that is used to specify the fold in which the data should appear (negative values will only be used in training, consistent with scikit-learn's PredefinedSplit module).
We have made the decision to keep our data formal relatively simple since we hope it will provide a language-agnostic interface to the data so that users of, for example, R, MATLAB, Python, C++, etc can use the data once it has been built. Datasets with several views into movement (eg with volunteers wearing several devices, or with IMUs providing not only acceleration data, but also gyroscope and magnetometer data) we have made the decision that each 'view' be contained in a separate file since in some cases the data are sampled at different rates. However, the data may be merged together using the subject_id
, time
, and sequence_id
fields of the file above.
The following table enumerates the datasets accounted for in this repository, sorted by the surname of the first author of the paper.
We will gladly accept contributions to this repository in any form, but particularly we welcome additional datasets, new feature extraction processes, view representations, and bug fixes.
New datasets can be added by contacting me via email or by submitting a new issue (preferred). Simply provide me with information required to populate a new row in the table above. If you have a transformer that will convert the data to the preferred format please attach this too. If not I will then attempt to write a converter for the data but this may take some time.
Two steps must be performed for a Pull Request to be accepted: 1. update the table above; and 2. add the transformer to the repository. These steps are outlined in more detail below:
The table above can be updated by adding a row with the following information:
| AuthorName | DatasetName | [PaperName](PaperURL) | [Description](DescriptionURL) | [Download](DownloadURL) | PublicationYear | SamplingFrequency | HasAccelerometer | HasGyroscope | HasMagnetometer | NumSubjects | NumActivities | Notes |
Please insert the new row alphabetically based on the first author's name and then by publication date if there is a tie. Note, the name of the dataset will be immutable and only in exceptional circumstances will the name be changed.
A new data transformer should be placed in src/converters/<DatasetName>.py
where <DatasetName>
matches the second element of the newly inserted row. This file must provide a function called <DatasetName>
wich accepts as an argument the Contained within this file should be a function called <DatasetName>
which returns pandas dataframes. Using the spherechallenge
dataset as an example, a file in src/data/spherechallenge.py
will contain the followign:
def spherechallenge(input_path):
data = load_sphere_challenge_data(input_path)
return data
It is important that there is consistency between the name of the dataset in the table above, the name of the file in the src
directory and the name of the function since the module importer reads the data information from this table and dynamically loads the transformation functions dynamically. In other words, the function must be importable as follows
from spherechallenge import spherechallenge
We have implemented several feature extraction processes in the src/features
directory and interfaces to map these features to the above datasets also. These should be relatively straightforward to add since they will typically operate on a matrix of acceleration data and will return a vector. As a simple example one may extract the mean
, standard deviation
, range
, min
and max
values as follows:
import numpy as np
stat_funcs = [np.mean, np.std, np.ptp, np.min, np.max]
def extract_stat_features(data):
return np.concatenate([func(data, axis=0) for func in stat_funcs])
Several pre-processing techniques are often applied to accelerometer data. For example, it is common to separate the 'body' and 'gravity' components from each other, compute the magnitude of the data etc.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── features <- The representation of the processed data.
│ ├── processed <- The intermediate data, transformed to the desired format.
│ └── raw <- The original, immutable data dump. All datasets have unique
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org