/GettingDataCourseProject

course project summarizing Samsung data for Getting & Cleaning Data course

Primary LanguageR

GettingDataCourseProject

course project summarizing Samsung data for Getting & Cleaning Data course

Files needed in working directory

  • X_test.txt - contains data for test subjects
  • X_train.txt - contains data for training subjects
  • features.txt - variable names (column headers) for test and training data
  • subject_test.txt - subjects associated with each row of test data
  • y_test.txt - activities associated with each row of test data
  • subject_train.txt- subjects associated with each row of training data
  • y_train.txt- activities associated with each row of training data

Outline of steps in the run_analysis.R code

  • Reads in test and training data, adds column names from the features.txt file and combines the two data sets into one dataframe called projectdata.

  • The original data has 561 variables. The second step in the code is to select only those that report mean and standard deviation for the measurments. Based on the variable names from features.txt, now used as column headers, this step searches for variable names that include "mean" or "std" and excludes all others including "meanFreq" and "Mean" because the latter two do not have associated standard deviation variables. The resulting dataframe has 66 variables.

  • Next, the code puts the variables in alphabetical order by column name. The selection process above first selects out all those with "mean" and then those with "std". Sorting the columns in alphabetical order puts the associated mean and std varibles closer together as in the original datasets. This steps makes tidier data by having related variables closer to each other in the dataset.

  • The next several lines of code all variables on the subjects and activities. The activities, originally labelled 1 through 6, are converted to more helpful text descriptions (Walking, Standing, Sitting, etc). It also adds a variable for treatment (test or train), otherwise that info is lost when the test and train data are added together. If and when analyses are run with this data set it may be important to group data by whether the subjects were in the test treatment or training treatment. This variable helps keep the data tidy and informative.

  • The final step groups the rows by activity for each of the 30 subjects and calculates the mean of each of the 66 variables. This summary table is the final output of the code.