HAR Datasets

This repository aims to provide a unified interface to datasets for the task of accelerometer-based Human Activity Recognition (HAR). The philosophy is to catalogue as many datasets as possible from a wide variety of recording conditions.

terms of data format, feature extraction, label space, sampling frequency, device location/orientation, etc. for the purpose of understanding the efficacy of transfer learning, online learning, lifelong learning, data representation, feature extraction across a large collection of datasets.

Project Structure

This project follows the DataScience CookieCutter template with the aim of facilitating reproducible models and results. the majority of commands are executed with the make command, and we also provide a high-level data loading interface.

Proposed Format

All data will be translated to the following a simple CSV format with the following columns:

time, subject_id, sequence_id, activity_labels, fold_id, x, y, z

where time is in seconds and of the type double, subject_id is an integer identifier of the subjects, sequence_id identifiers of contiguous activities (one subject may therefore perform a task several times), x, y, z are the x, y, and z axis data (whether acceleration, magnetometer, or gyroscope), and activity_labels are the labels of the dataset. Finally, fold_id is an identifier that is used to specify the fold in which the data should appear (negative values will only be used in training, consistent with scikit-learn's PredefinedSplit module).

We have made the decision to keep our data formal relatively simple since we hope it will provide a language-agnostic interface to the data so that users of, for example, R, MATLAB, Python, C++, etc can use the data once it has been built. Datasets with several views into movement (eg with volunteers wearing several devices, or with IMUs providing not only acceleration data, but also gyroscope and magnetometer data) we have made the decision that each 'view' be contained in a separate file since in some cases the data are sampled at different rates. However, the data may be merged together using the subject_id, time, and sequence_id fields of the file above.

Current Datasets

The following table enumerates the datasets accounted for in this repository, sorted by the surname of the first author of the paper.

First Author	Dataset Name	Paper (URL)	Data Description (URL)	Data Download (URL)	Year	fs	Accel	Gyro	Mag	#Subjects	#Activities	Notes
Anguita	anguita2013	A Public Domain Dataset for Human Activity Recognition Using Smartphones	Description	Download	2013	50	yes	yes		30	6
Banos	banos2012	A benchmark dataset to evaluate sensor displacement in activity recognition	Description	Download	2012	50	yes	yes	yes	17	33
Banos	banos2015	mHealthDroid: a novel framework for agile development of mobile health applications	Description	Download	2015	50	yes	yes	yes	10	12
Barshan	barshan2014	Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units	Description	Download	2014	25	yes	yes	yes	8	19
Bruno	bruno2013	Analysis of Human Behavior Recognition Algorithms based on Acceleration Data	Description	Download	2013	32	yes			16	14	Notes
Casale	casale2015	Personalization and user verification in wearable systems using biometric walking patterns			2012	52	yes			7	15
Chen	utdmhad	UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor	Description	Download	2015	50	yes	yes		9	21
Chavarriaga	opportunity	The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition	Description	Download	2012	30	yes	yes	yes	12	7	Several annotation tracks.
Chereshnev	hugadb	HuGaDB: Human Gait Database for Activity Recognition from Wearable Inertial Sensor Networks	Description	Download	2017	~56	yes	yes		18	12
Kwapisz	wisdm	Activity Recognition using Cell Phone Accelerometers	Description	Download	2012	20	yes			29	6
Micucci	micucci2017	UniMiB SHAR: A Dataset for Human Activity Recognition Using Acceleration Data from Smartphones	Description	Download	2017	50	yes			30	8	Notes
Ortiz	ortiz2015	Human Activity Recognition on Smartphones with Awareness of Basic Activities and Postural Transitions	Description	Download	2015	50	yes	yes		?	7	With postural transitions
Reiss	pamap2	Introducing a new benchmarked dataset for activity monitoring	Description	Download	2012	100	yes	yes	yes	10	12
Shoaib	shoaib2014	Fusion of Smartphone Motion Sensors for Physical Activity Recognition	Description	Download	2014	50	yes	yes	yes	7	7
Siirtola	siirtola2012	Recognizing human activities user-independently on smartphones based on accelerometer data	Description	Download	2012	40	yes			7	5
Stisen	stisen2015	Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition	Description	Download	2015	50-200	yes			9	6
Sztyler	sztyler2016	On-body localization of wearable devices: An investigation of position-aware activity recognition	Description	Download	2016	50	yes	yes	yes	15	8	Many other sensors also (video, light, sound, etc)
Twomey	spherechallenge	The SPHERE Challenge: Activity Recognition with Multimodal Sensor Data	Description	Download	2016	20	yes			20	20
Ugulino	ugulino2012	Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements	Description	Download	2012	50	yes			4	5
Vavoulas	mobiact	The MobiAct Dataset: Recognition of Activities of Daily Living using Smartphones	Description	Download	2016	100	yes			57	9
Zhang	uschad	USC-HAD: A Daily Activity Dataset for Ubiquitous Activity Recognition Using Wearable Sensors	Description	Download	2012	100	yes	yes		15	12

Contributing

We will gladly accept contributions to this repository in any form, but particularly we welcome additional datasets, new feature extraction processes, view representations, and bug fixes.

Adding new Datasets

New datasets can be added by contacting me via email or by submitting a new issue (preferred). Simply provide me with information required to populate a new row in the table above. If you have a transformer that will convert the data to the preferred format please attach this too. If not I will then attempt to write a converter for the data but this may take some time.

Update via Pull Request

Two steps must be performed for a Pull Request to be accepted: 1. update the table above; and 2. add the transformer to the repository. These steps are outlined in more detail below:

Update the Table

The table above can be updated by adding a row with the following information:

| AuthorName | DatasetName | [PaperName](PaperURL) | [Description](DescriptionURL) | [Download](DownloadURL) | PublicationYear | SamplingFrequency | HasAccelerometer | HasGyroscope | HasMagnetometer | NumSubjects | NumActivities | Notes |

Please insert the new row alphabetically based on the first author's name and then by publication date if there is a tie. Note, the name of the dataset will be immutable and only in exceptional circumstances will the name be changed.

Add Transformer

A new data transformer should be placed in src/converters/<DatasetName>.py where <DatasetName> matches the second element of the newly inserted row. This file must provide a function called <DatasetName> wich accepts as an argument the Contained within this file should be a function called <DatasetName> which returns pandas dataframes. Using the spherechallenge dataset as an example, a file in src/data/spherechallenge.py will contain the followign:

def spherechallenge(input_path):
    data = load_sphere_challenge_data(input_path)
    return data

It is important that there is consistency between the name of the dataset in the table above, the name of the file in the src directory and the name of the function since the module importer reads the data information from this table and dynamically loads the transformation functions dynamically. In other words, the function must be importable as follows

from spherechallenge import spherechallenge

Adding New Feature Representations

We have implemented several feature extraction processes in the src/features directory and interfaces to map these features to the above datasets also. These should be relatively straightforward to add since they will typically operate on a matrix of acceleration data and will return a vector. As a simple example one may extract the mean, standard deviation, range, min and max values as follows:

import numpy as np

stat_funcs = [np.mean, np.std, np.ptp, np.min, np.max]

def extract_stat_features(data):
	return np.concatenate([func(data, axis=0) for func in stat_funcs])

Adding Transformers

Several pre-processing techniques are often applied to accelerometer data. For example, it is common to separate the 'body' and 'gravity' components from each other, compute the magnitude of the data etc.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── features       <- The representation of the processed data.
│   ├── processed      <- The intermediate data, transformed to the desired format.
│   └── raw            <- The original, immutable data dump. All datasets have unique
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

siddharthdivi/har_datasets