README

project: Getting and Cleaning Data Course Project

author: Dmitry B. Grekov

date: Monday, February 16, 2015

Summary

The code in the Run_analysis.R script file is targeted to create a tidy dataset according to the requirements of the "Coursera: Getting and Cleaning Data" Course Project assignment.

The source raw data is available by the following link:

https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

A full description is available at the site where the data was obtained:

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

The script assumes that the source zip file is already downloaded and extracted in the current folder. The source data is stored in several separate files, we are mostly interested in the following list:

  • training dataset (UCI HAR Dataset/train)
    • X_train.txt - the training dataset itself
    • y_train.txt - activity codes for each measurement
    • subject_train.txt - codes of the subjects who performed the activity
  • test dataset (UCI HAR Dataset/train)
    • X_test.txt - the test dataset
    • y_train.txt - activity codes
    • subject_test.txt - subject codes
  • desciptive files (UCI HAR Dataset)
    • features.txt - measurement captions
    • activity_labels.txt - activity codes with their names

No other files are used by the script.

The script contains the following steps:

  1. determine the list of the necessary columns
  2. build the subsets for training and test data separately
  3. merge the subsets into one single dataset
  4. aggregate the dataset

Building the list of necessary columns

The script analyses the measurement captions from the features.txt file.

  • filters them lo leave only those containing mean() or std() patterns
  • makes the captions more readable by decoding abbrevations ("Acc", "Gyro", ...), removing brackets and using dots as varaible parts separators
  • places the result into the features data.table with two columns:
    • id - sequental number in the datdaset
    • caption - readable column caption

Building train and test subsets

First, the script builds the train subset. It reads the X_train.txt file and subsets only the columns contained in the features$id vector. Then it adds activity and subject codes from Y_train.txt and subject_train.txt files.

The test subset is built just the same, only the file names are different (see 'Summary' section).

The subsets are stored in x.train and x.test variables accordingly.

Merging the subsets

At this stage the x.train and x.test subsets are merged into a single x.full dataset. Also, after merging the subsets, activity codes are coerced into factors with levels described in the activity_labels.txt file.

Aggregating the dataset

According to the assignment, the average for each variable should be calculated for each activity and each subject x.full. This is done by (1) melting the x.full dataset and then aggregating it using dcast function. Well, not a straight through way, but works fine and fast enough.

The result is set to x.agg variable and saved to tidy.txt file.