This repository contains the processing code for producing tidy datasets from the raw dataset at http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
In order to produce the tidy datasets, I do the following:
- Read the measurement features from the file
features.txt
, recognizing the ordered index and name of each measurement - Subset the recognized measurements to those whose names contain either the text
mean
or the textstd
which signify that they are either an arithmetic mean or standard deviation of the corresponding series. - Read the descriptive textual activity labels from the file
activity_labels.txt
- For the
test
dataset, read in the subject column data for all the test observations from the filesubject_*.txt
- From the
test
dataset, read in the measurement feautres from the fileX_*.txt
. The measurement values should correspond to the features read in step 2 before the subsetting. The measurement values are then projected onto only the features left after the subsetting in step 2. - The activities associated with all the records are read in from the file
y_*.txt
for thetest
dataset, and the numerical activity qualitative values are replaced with the textual labels acquired from step 3. - The columns from steps 4, 5, and 6 are then combined into one table where the rows are the observations and the columns are the variables.
- The steps 4:7 are repeated for the
train
dataset. - The outcome datasets from steps 7 and 8 are combined vertically (the rows from the latter are appended to rows from the former) and this is our final tidy dataset:
tidy_data.csv
- The separate dataset
tidy_data-averaged.csv
is produced by further modifying the dataset from step 9 by computing the arithmetic mean of each measurement variable across all observations, while grouping by the subject and activity.