Course Project for Coursera Getting and Cleaning Data 2014 course
This repository consists of two other files:
run_analysis.R
contains the R code which downloads, filters, and tidies up the datauci_har_tidy.txt
contains the data output byrun_analysis_R
after it has been filtered and tidied
The raw data comes from the HCI Machine Learning Repository and is derived from recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone (Samsung Galaxy S II) with embedded inertial sensors.
Because the raw input data file is very large, it isn't included in the repo.
Instead, the run_analysis.R
includes intructions that first downloads the
data
and description file set
and extracts the archive.
The rest of the run_analysis.R
script performs the following operations on
the files contained in HCI's archive:
- To merge the training and test data sets to create one data set:
- It reads in the training data set from
X_train.txt
, which consists of measurements, and appends anactivity
column consisting of the training labels from the filey_train.txt
and asubject.ID
column consisting of the ID for the subject (volunteer) from the filesubject_train.txt
- It does the same thing for the testing data set
- It combines the training and testing data set by row
- It reads in the training data set from
- To extract only the measurements on the mean and standard deviation for
each signal:
-
It reads in the names of the features from
features.txt
, verifying that the features are listed in the same order as the columns inX_train.txt
andX_test.txt
-
It figures out the relevant features by looking for
std()
orvar()
in the variable names.Note that we only look for means that have been computed generally rather than done for a specific purpose (e.g. the
meanFreq
feature) and computed by averaging the signals in a signal window sample (the angle() variable collumns).
-
- To give descriptive activity names to the activities in the data set:
-
It reads in descriptions of the activities
-
It converts the activities in the data set, coded as a number from 1-6, into a factor with descriptive labels from the file
activity_labels.txt
. The mapping is as follows:1 WALKING 2 WALKING_UPSTAIRS 3 WALKING_DOWNSTAIRS 4 SITTING 5 STANDING 6 LAYING
-
- To label the columns in data set with descriptive variable names:
- It first corrects the typo "BodyBody" to "Body". We assume it's a
typo, based on descriptions in the
README.txt
file in the archive and the fact that the feature names stay distinct after the correction. - We replace the abbreviations by more descriptive names, except for the standard deviation for which we assign the most common industry-accepted abbreviation "SD".
- It first corrects the typo "BodyBody" to "Body". We assume it's a
typo, based on descriptions in the
- To create a second, independent tidy data set with the average of each
variable for each activity and each subject:
- It groups the data by subject ID and by activity
- It summarizes the data in the groups by averaging
The summarized tidy data set is output to file uci_har_tidy.txt
in a simple
space-separated text file with the first row serving as a header with the
column names.