cleaningdata

######Readme Hello everybody. I will describe here how my script works. I know it is not perfect and I can optimize it, make it faster and smaller, but it is another story. At least it works and I ll describe how.

train<-read.table("UCI HAR Dataset/train/X_train.txt")
test<-read.table("UCI HAR Dataset/test/X_test.txt")

We are reading raw data from the folder UCI HAR Dataset in working directory

install.packages("plyr")
install.packages("dplyr")
install.packages("Hmisc")
library(plyr)
library(dplyr)
library(Hmisc)

This one is to be sure that all packages are installed

test_df<-tbl_df(test)
train_df<-tbl_df(train)

We will convert the data into the local data frame to use plyr and dplyr package

mg<-merge(train_df, test_df, by = intersect(names(train_df), names(test_df)),all=TRUE)

The mg is the merged dataset (10299x561)

features<-read.table("UCI HAR Dataset/features.txt")

Taking the name of variables (561)

coln<-features[,2]
coln<-as.vector(coln)
colnames(mg)<-coln

Here we are using this name from features to our merged dataset

valid_column_names <- make.names(names=names(mg), unique=TRUE, allow_ = TRUE)
- names(mg) <- valid_column_names

I found out that R does count some column names from features as duplicate. This is a solution to give them a Unique name and solve this problem So, now mg is a data set 10299x561 with unique column names

mgs<-select(mg,matches("(mean|std)"))

And we need columns with only mean and std in it (10299x86)

ytest<-read.table("UCI HAR Dataset/test/y_test.txt")
ytrain<-read.table("UCI HAR Dataset/train/y_train.txt")

We need to find out which observation for which activity. This information can be found in this two files

mgac<-bind_rows(ytrain, ytest)
mgac<-as.data.frame(mgac)
rown<-mgac[,1]
rown<-as.vector(rown)

We need to merge name of observation in one set and take a vector of it. So we have a vector of 10299 which consist only of digitals from 1 to 6

row.f<-factor(rown,labels =c('WALKING', 'WALKING_UPSTAIRS' , 'WALKING_DOWNSTAIRS', 'SITTING', 'STANDING' , 'LAYING'))

Making a factor of it with name of each activity

*mgs<-cbind(mgs,row.f)

Append this vector for data set with selected variables of mean and std

walking<-filter(mgs,row.f == "WALKING")
walkingup<-filter(mgs,row.f == "WALKING_UPSTAIRS")
walkingdo<-filter(mgs,row.f == "WALKING_DOWNSTAIRS")
sitting<-filter(mgs,row.f == "SITTING")
standing<-filter(mgs,row.f == "STANDING")
laying<-filter(mgs,row.f == "LAYING")

We are making 6 data sets, each set has observation for only one Activity

walking <-summarise_each(walking,funs(mean), -row.f)
walkingup <-summarise_each(walkingup,funs(mean), -row.f)
walkingdo <-summarise_each(walkingdo,funs(mean), -row.f)
sitting <-summarise_each(sitting,funs(mean), -row.f)
standing <-summarise_each(standing,funs(mean), -row.f)
laying <-summarise_each(laying,funs(mean), -row.f)

Taking average of each variable for for only one Activity and excluding column name which cannot be averaged(name of activity)

walking<-cbind(walking,Activity="Walking")
walkingup<-cbind(walkingup,Activity="Walking_upstairs")
walkingdo<-cbind(walkingdo,Activity="Walking_downstairs")
sitting <-cbind(sitting,Activity="Sitting")
standing <-cbind(standing,Activity="Standing")
laying <-cbind(laying,Activity="Laying")

Now we have 6 data sets with 1 observation and 86 variables, variables are averaged from previous one. We need to add a column with the name of activity

ac2<-merge(walking, walkingup, by = intersect(names(walking), names(walkingup)),all=TRUE)
ac3<-merge(ac2, walkingdo, by = intersect(names(ac2), names(walkingdo)),all=TRUE)
ac4<-merge(ac3, sitting, by = intersect(names(ac3), names(sitting)),all=TRUE)
ac5<-merge(ac4, standing, by = intersect(names(ac4), names(standing)),all=TRUE)
final<-merge(ac5, laying, by = intersect(names(ac5), names(laying)),all=TRUE)

We need to merge this 6 data sets into the one. It has 6 observation and 87 variables.

col_idx <- grep("Activity", names(final))
final <- final[, c(col_idx, (1:ncol(final))[-col_idx])]

We need to order columns, so the Name of activity is the first. Of course we could use row names, but it is written than we have to use row.name=FALSE when we ll write the file

write.table(final,"final_solution.txt",row.name=FALSE,col.names = FALSE )

File is recorded

VENHEADs/cleaningdata

cleaningdata