/CleanData-Project

Getting and Cleaning Data - Course Projects

CleanData-Project

Getting and Cleaning Data - Course Projects

The purpose of the project was to produce a tidy data set from the files provided. The commands which were used to produce the tidy set and comments, are stored in "run_analysis.R" file.

Steps

First, merge the training and the test sets to create one data set, extract only the measurements on the mean and standard deviation for each measurement, use descriptive activity names to name the activities in the data set and label the data set with descriptive activity names. Lastly, create a second, independent tidy data set with the average of each variable for each activity and each subject.

Step 1

Downloading and unzipping appropriate files was performed from the level of R, using functions:

download.file unzip Subsequent components of the future dataset (that is: training and test x's, y's and subjects) were read using functions:

file.name (in order to produce path from a few parts, also containing blank spaces) read.table (as the files were in .txt format) One data frame was produced using functions:

cbind (adds columns to existing object - used for binding x variables with subject and activity) rbind (adds rows to existing object - used for adding test data frame to the bottom of training data frame) For the author, this step also involved naming columns with the names of features which were provided in a separate file. This was achieved using the following functions:

as.character (which changed the type of label from factor to string) colnames (to add the desired attribute to data frame)

Step 2

Choosing only the variables which were means or standard deviations of some characteristics was performed with the following functions:

grepl (searching a pattern of characters in a string) which (returning indices that meet some conditions) Columns labeled with meanFreq(), as the file 'features_info.txt' indicated that this is "weighted average of the frequency components to obtain a mean frequency", which simply means that this is also a summary statistics, but of a variable measuring some frequencies.

The resulting data set consists of 81 columns (79 variables plus columns indicating activity and subject).

Step 3 and step 4

Labeling the activities not with numbers, but descriptive names, consisted of two steps. First part was to read the file containing activity names, the second part - to convert variable "activity" to the factor and name the levels of activity with those names. This was achieved with functions:

file.path and read.table (as described above, in step 1) factor(data, labels) (where data is the desired "activity" column and labels are names read from the file "activity_labels.txt")

Step 5

Creating the second table, with average of each variable for each activity and each subject, was performed with:

melt (creating a table with 813621 rows, each containing subject number, activity name - as we want to summarize data according to these columns - and variable name and its value) dcast (creating data frame output with 180 rows - because of 30 subjects, each with 6 possible activity states, and 81 columns - 79 variables, subject number and activity name) It is important to set the aggregation function as mean (in function dcast, argument fun.aggregate), as the default value is sum, which is not the desired summary function.

The tidy dataset is written to a separate .txt file using the following function:

write.table ("row.names" argument needs to be set as FALSE)