The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set using R.
This work is based on the "Human Activity Recognition Using Smartphones Dataset" Version 1.0 : http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
The project contains the following files:
./datasets/UCI HAR Dataset
: Human Activity Recognition Using Smartphones Dataset Version 1.0.README.md
: description of the steps for perfoming the dataset analysis.CodeBook.md
: indicate all the variables and summaries calculated, along with units, and any other relevant information.run_analysis.R
: R scripts performing the analysis on the dataset.
In order to perfom the analysis of the "Human Activity Recognition Using Smartphones Dataset" you need to run the run_analysis.R
R script. For that you need to call the function RunAnalysis()
. The analysis will perform the following steps:
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive variable names.
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
The script uses dplyr
library
library(dplyr)
All you need to do is to run the run_analysis.R
script.
source("run_analysis.R")
RunAnalysis()
The scripts writes a tidy.txt
which is the result of the cleaning of the initial Human Activity Recognition Using Smartphones Dataset. The output file containis 180 rows and 68 columns. The description of the data in the file can be found in the CodeBook.md file.
The run_analysis.R
script performs the following analysis steps
First, all paths to files that have to be merged together must be initialized. This is the very first part of the script.
# file path definitions
basePath <- "./datasets/UCI HAR Dataset"
trainsetPath <- file.path(basePath, "train/X_train.txt")
trainActivityPath <- file.path(basePath, "train/Y_train.txt")
trainSubjectPath <- file.path(basePath, "train/subject_train.txt")
testsetPath <- file.path(basePath, "test/X_test.txt")
testActivityPath <- file.path(basePath, "test/Y_test.txt")
testSubjectPath <- file.path(basePath, "test/subject_test.txt")
headersPath <- file.path(basePath, "features.txt")
activityLabelsPath <- file.path(basePath, "activity_labels.txt")
The read function is definied as follows:
# Reads csv file from a given @fromPath
ReadData <- function(fromPath) {
df <- read.csv(fromPath, sep = "", header = FALSE, stringsAsFactors = FALSE)
df
}
Read activity labels from train and test sets and transforming it :
# read activity labels
lbls <- ReadData(activityLabelsPath)
trainLbls <- ReadData(trainActivityPath)
testLbls <- ReadData(testActivityPath)
# merging ids of activities with its labels
trainLbls$V1 <- lbls[trainLbls$V1, 2]
testLbls$V1 <- lbls[testLbls$V1, 2]
# append train and test activity labels
lblsAll <- MergeDfs(trainLbls, testLbls, "activity")
lblsAll[, "activity"] <- as.factor(lblsAll[, "activity"])
Reading subjects and transforming it:
# read subjects
trainSubjectIds <- ReadData(trainSubjectPath)
testSubjectIds <- ReadData(testSubjectPath)
# merge train and test subject into one data frame adding column name
subjectAll <- MergeDfs(trainSubjectIds, testSubjectIds, "subject")
Reading feature labels:
# read feature labels
headersDf <- ReadData(headersPath)
headers <- c(headersDf[, 2]) # we are only interested in names
Reading train and test sets
# reading train and test files
trainDf <- ReadData(trainsetPath)
testDf <- ReadData(testsetPath)
Step 1: Merges the training and the test sets to create one data set.
Step 2: Extracts only the measurements on the mean and standard deviation for each measurement.
Step 3: Uses descriptive activity names to name the activities in the data set
The 3 first steps of the taks are carried out with the following pipline of function using dplyr
chain operator %>%
:
# merging train and test adding fearture labels as headers
dfAll <- MergeDfs(trainDf, testDf, headers) %>% # step 1
ExtractMeanStd %>% # step 2: extract only mean() and std() columns
cbind(subjectAll) %>% # adding subject column
cbind(select(lblsAll, activity)) # step 3: adding activity column
First in the Step 1, we merge the train and the test data sets with the following function:
# Appends train and test data frames together,
# Adds headers name to the data set.
MergeDfs <- function(train, test, headers) {
dfJoined <- rbind(train, test)
names(dfJoined) <- headers
dfJoined
}
Step 2 extracts the mean() and std() columns:
# Extracts mean and standard deviation columns
ExtractMeanStd <- function(df) {
df[,grep("(mean\\(\\)|std\\(\\))", names(df))]
}
Step 3 adds subject and activity columns:
cbind(subjectAll) %>% # adding subject column
cbind(select(lblsAll, activity)) # step 3: adding activity columns
Step 4: Appropriately labels the data set with descriptive variable names.
It renames column that match the tidy definition (readable and clear meaning).
# step 4: renaming columns with tidy names
names(dfAll) <- RenameColumns(names(dfAll))
The rename column function is definied as follows:
# Rename columns names
RenameColumns <- function(name) {
newName <- gsub("^t", "time", name)
newName <- gsub("^f", "fequency", newName)
newName <- gsub("([A]|[B]|[G]|[J]|[M])", "\\.\\1", newName)
newName <- gsub("Acc", "Accelerometer", newName)
newName <- gsub("Gyro", "Gyroscope", newName)
newName <- gsub("Mag", "Magnitude", newName)
newName <- gsub("\\-", "\\.", newName)
newName <- gsub("\\(\\)", "", newName)
tolower(newName)
}
Step 5: From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
# step 5: grouping with subject and activity and applying mean to all the numeric columns
tidyDf <- dfAll %>% group_by(subject, activity) %>% summarise_each(funs(mean))
Finally the tidy dataset is written:
write.table(format(tidyDf, scientific=T),
file = "tidy.txt",
row.name=FALSE)
The full script code is here : run_analysis.R