This repository aims to provide a unified interface to wearable-based Human Activity Recognition (HAR) datasets. The philosophy is to acquire many datasets from a wide variety of recording conditions and to translate these into a consistent data format in order to more easily address open questions on feature extraction/representation learning, meta/transfer learning, active learning amongst other tasks. Ultimately, I am to create a home for the easier understanding of the stability, strengths and weaknesses of the state-of-the-art in HAR.
It is good practise to use virtual environments when using this. I have recently been using miniconda as my python management system. It works exactly like anaconda. The following commands create a new environment, activates it and installs the requirements to that environment.
conda create python=3.7 --name har_datasets
conda activat har_datasets
conda install --file requirements.txt
These are packaged into the make environment
command.
Several global variables are required for this library to work. I set these up with the dotenv library. This searches for a file called .env
that should be found in the project root. It then loads environment variables called PROJECT_ROOT
, ZIP_ROOT
and BUILD_ROOT
. In my system, these are set up roughly as follows.
export PROJECT_ROOT = "/users/username/workspace/har_datasets"
export ZIP_ROOT = "/users/username/workspace/har_datasets/data/zip"
export BUILD_ROOT = "/users/username/workspace/har_datasets/data/build"
The data from all datasets listed in this project are converted into one consistent format that consistes of four key elements:
- the train/validation/test fold definition file;
- the label file;
- the data file; and
- an index file.
Note, the serialisation format used in this repository is that data are stored on a per-sample basis. This means that each of the files listed above will have the same number of rows.
The following columns are required for the index file:
subject, trail, time
subject
defines a subject identifier, trial
allows for different trials to be specified (eg it can distinguish data from subjects who perform a task several times), and time
defines the time (absolute or relative). Subject and trial should be integers, but need not be contiguous. Although time can be considered unnecessary in many applications (especially if the recording was done in a controlled environment or following a script) it is added here to allow for the detection of missing data (missing time stamps) and time-of-day features (if time
represents epoch time, for example).
This file must have three columns only.
The following structure is required for the task files
label_vals
This file must have at least one column. In general, it is expected that the column will be a list of strings (where the string corresponds to the target). This is not a requirement, however, and the label values may be vector-valued. It is important that the correct model and evaluation criteria are associated with the task.
The data format is quite simple:
x, y, z
where x
, y
and z
correspond to the axes of the wearable. By default different files are created for each modality (ie accelerometer, gyroscope and magnetomoter) and for each location (eg wrist, waist). For example, if one accelerometer is on the wrist a file called accel-wrist
will be created for it. There is no restriction on the number of colums in this file, but we expect that more often than not 3 columns will be present for each axis of the device.
This file must have at least one column.
Train and test folds are defined by the columns of this file:
fold_1
-1
-1
-1
0
0
0
1
1
1
The behaviour of these folds is based on scikit-learn's PredefinedSplit module. Additional folds can (if necessary) be defined by adding supplementary columns to this file. For example if doing 10 times 10-fold cross validation, 10 fold identifiers would be contained in each of the 10 columns.
This file must have at least one column.
Several special fold definitions are also supported. LOSO
performs leave one subject out cross validation, and deployable
learns models on all of the data with the expectation that this model is to be deployed outside of the scope of the pipeline that created it.
I hope to receive pull requests for new datasets, processing methods, features, and models to this repository. Requests are likely to be accepted once the exact data format, feature extraction, modelling and evaluation interfaces are relatively stable.
- Create a new yaml file in the
metadata/datasets
directory and fill out the information as accurately as possible. Follow the styles and detail given in the entries namedanguita2013
,pamap2
anduschad
. The entry of accurate metadata will be heavily strictly moderated before a submission is accepted. Note:- The name of the file and the
name
filed in the yaml file dataset name must be lower case. - List all sensor modalities in the dataset in the
modalities
field. The modality names should be consistent with the values found inmetadata/modality.yaml
. - List all sensor placements in the dataset in the
placements
field The placement names should be consistent with the values found inmetadata/placement.yaml
. - List all outputs in the dataset in the
sources
field. For example, if a data source arrives from an accelerometer placed on the wrist, a dict entry like{"placement": "wrist", "modality": "accel"}
. This can be tedious, but there is great value in doing this. - If the dataset introduces a new task, add a new file to the
metadata/tasks/<task-name>.yaml
file. List all new target names in this file (seemetadata/tasks/har.yaml
for example). - If the dataset introduces a new target to an existing task, add it to the end of
tasks/<task-name>.yaml
. - If the sensor has been placed on a new location add it to the end of
metadata/placement.yaml
. - If the sensor is of a new modality, add it to the end of
metadata/modality.yaml
.
- The name of the file and the
- Run
make table
. This will update the dataset table in thetables
directory. Ensure this command executes successully and verify that the entered information is accurate. - Run
make data
. This will download the archive automatically based on the URLs provided in thedownload_urls
field from step 1 above. - Copy the file
src/datasets/__new__.py
tosrc/datasets/<dataset-name>.py
(<dataset-name>
is defined by #1 above). The prupose of this file is to translate the data to the expected format described in the sections above. In particular, separate files with the wearable data, annotated labels, pre-defined folds, and index files are required. Use the existing examples of the aforementioned datasets (anguita2013
,pamap2
anduschad
) that can be found insrc/datasets
as examples of how this has been achieved.
(Under construction. See examples/basic_har.py
for basic examples.)
(Under construction. See src/models/sklearn/basic.py
for basic examples.)
The following table enumerates the datasets that are under consideration for inclusion in this repository.
This project follows the DataScience CookieCutter template with the aim of facilitating reproducible models and results. the majority of commands are executed with the make
command, and we also provide a high-level data loading interface.