Get Data - Course Project
This document provides the step-by-step on how to transform and summarize triaxial acceleration and angular velocity of typical body movements. The data - which was generated using motion processors on a Samsung Galaxy SIII - can be downloaded from UCI Machine Learning Repository.
Dataset URL: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
The dataset provides:
- 10299 samples
- 561 measures for each sample
- 6 motion activity types
Step-by-step on cleaning up data
1. Read and merge test and training datasets.
Using a list with values 'test' and 'train' we pass it into a for loop and perform the following operations.
sets <- c("test", "train")
feature_col_names <- as.character(read.table("features.txt")$V2)
data <- NULL
for(set in sets){
x_filename <- paste("X_", set, ".txt", sep = "")
x_data <- read.table(file.path(set, x_filename), col.names = feature_col_names)
x_data <- extract_mean_std(x_data)
y_filename <- paste("y_", set, ".txt", sep="")
y_data <- read.table(file.path(set, y_filename), col.names = c('Activity'))
y_data <- name_activities(y_data)
subject_filename <- paste("subject_", set, ".txt", sep="")
subject_data <- read.table(file.path(set, subject_filename), col.names = c('Subject'))
data <- rbind(data, cbind(subject_data, y_data, x_data))
}
2. Extract only mean and standard deviation columns.
Using Regular Expressions, we were able to detect and filter only the desired columns: those regarding Mean and Std values.
extract_mean_std <- function(df){
df[grep("(mean|std)\\.", colnames(df))]
}
3. Name activities that were mere numbers on original raw file.
Simply replacing numbers ranging from 1 to 6 to their actual labels.
name_activities <- function(activity_data){
activities <- c("WALKING", "WALKING_UPSTAIRS", "WALKING_DOWNSTAIRS", "SITTING", "STANDING", "LAYING")
lapply(activity_data, function(x) activities[x])
}
4. A series of replaces to make column names more readable.
In order to having more readable column names, a
col <- colnames(data)
colnames(data) <- lapply(col, function(x){
# Removing all dots from column names
x <- gsub("\\.+", "", x, perl=T)
# Expanding initial letters 't' and 'f'
x <- gsub("^t", "Time", x, perl=T)
x <- gsub("^f", "Feature", x, perl=T)
# Expanding words
x <- gsub("Acc", "Acceleration", x, perl=T)
# Camel case on the words 'mean' 'std'
x <- gsub("mean", "Mean", x, perl=T)
x <- gsub("std", "Std", x, perl=T)
})
5. Summarising data
Using dplyr library, was possible to group data by Subject id and Activity name. Every other column was summarised using the Mean function.
library(dplyr)
averages <- data %>% group_by(Subject, Activity) %>% summarise_each(funs(mean))
6. Writing out the tidy data
Time to write this out to a flat txt file.
write.csv(data, file="tidydata.txt", col.names=FALSE)