Coursera "Getting and Cleaning Data" (John Hopkins) course
Course Project
Jon Ide
You should create one R script called run_analysis.R that does the following.
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive variable names.
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
- Note: It is assumed that the working directory in which run_analysis.R is run contains subfolders "test" and "train"
- Read 'features.txt' file into features data frame
- Read 'activity_labels.txt' file into activity_labels data frame
- Read the training data
- Read the training data 'x_train.txt' file into data frame x_train
- Read the training data activity codes 'y_train.txt' file into data frame y_train
- Read the training data subject codes 'subject_train.txt' file into data frame subject_train
- Read the test data
- Read the test data 'x_test.txt' file into data frame x_test
- Read the test data activity codes 'y_test.txt' file into data frame y_test
- Read the test data subject codes 'subject_test.txt' file into data frame subject_test
- Use rbind to combine the training and testing measurement data into data frame x_combined
- Find the columns containing "-mean()" or "-std()" and keep just those columns in data frame x_combined_means_sds
- Use rbind to combine the training and testing activity codes into data frame y_combined with column name "Activity"
- Use rbind to combine the training and testing subjects into data frame subject_combined with column name "Subject"
- Replace numeric activity codes in y_combined with their text equivalents
- Add a column "SubjAct" to combined_data that contains subject and activity concatenated with separator "-". This will be used by group_by.
- Using dplyr, create data frame grouped:
- Group by the SubjAct column using group_by
- Remove the Subject and Activity columns, using select
- Calculate the mean of each variable for each group, using summarize_each
- Add a Temp column with SubjAct split to retrieve the Subject and Activity
- Restore the Subject and Activity columns and remove the SubjAct and Temp columns
- Clean up the variable names
- Turn Subject into an integer so it sorts correctly and sort the table on Subject using dplyr's arrange
- Turn Activity into a factor (not really necessary)
- Save the tidy data set in text file 'tidy.txt'