project: Getting and Cleaning Data Course Project
author: Dmitry B. Grekov
date: Monday, February 16, 2015
The code in the Run_analysis.R
script file is targeted to create a tidy dataset according to the requirements of the "Coursera: Getting and Cleaning Data" Course Project assignment.
The source raw data is available by the following link:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
A full description is available at the site where the data was obtained:
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
The script assumes that the source zip file is already downloaded and extracted in the current folder. The source data is stored in several separate files, we are mostly interested in the following list:
- training dataset (UCI HAR Dataset/train)
X_train.txt
- the training dataset itselfy_train.txt
- activity codes for each measurementsubject_train.txt
- codes of the subjects who performed the activity
- test dataset (UCI HAR Dataset/train)
X_test.txt
- the test datasety_train.txt
- activity codessubject_test.txt
- subject codes
- desciptive files (UCI HAR Dataset)
features.txt
- measurement captionsactivity_labels.txt
- activity codes with their names
No other files are used by the script.
The script contains the following steps:
- determine the list of the necessary columns
- build the subsets for training and test data separately
- merge the subsets into one single dataset
- aggregate the dataset
The script analyses the measurement captions from the features.txt
file.
- filters them lo leave only those containing
mean()
orstd()
patterns - makes the captions more readable by decoding abbrevations ("Acc", "Gyro", ...), removing brackets and using dots as varaible parts separators
- places the result into the
features
data.table with two columns:- id - sequental number in the datdaset
- caption - readable column caption
First, the script builds the train subset. It reads the X_train.txt
file and subsets only the columns contained in the features$id
vector. Then it adds activity and subject codes from Y_train.txt
and subject_train.txt
files.
The test subset is built just the same, only the file names are different (see 'Summary' section).
The subsets are stored in x.train
and x.test
variables accordingly.
At this stage the x.train
and x.test
subsets are merged into a single x.full
dataset.
Also, after merging the subsets, activity codes are coerced into factors with levels described in the activity_labels.txt
file.
According to the assignment, the average for each variable should be calculated for each activity and each subject x.full
.
This is done by (1) melting the x.full
dataset and then aggregating it using dcast
function.
Well, not a straight through way, but works fine and fast enough.
The result is set to x.agg
variable and saved to tidy.txt
file.