/cleaningdata

Example of cleaning data with a codebook

Primary LanguageR

cleaningdata

######Readme Hello everybody. I will describe here how my script works. I know it is not perfect and I can optimize it, make it faster and smaller, but it is another story. At least it works and I ll describe how.

  • train<-read.table("UCI HAR Dataset/train/X_train.txt")
  • test<-read.table("UCI HAR Dataset/test/X_test.txt")

We are reading raw data from the folder UCI HAR Dataset in working directory

  • install.packages("plyr")
  • install.packages("dplyr")
  • install.packages("Hmisc")
  • library(plyr)
  • library(dplyr)
  • library(Hmisc)

This one is to be sure that all packages are installed

  • test_df<-tbl_df(test)
  • train_df<-tbl_df(train)

We will convert the data into the local data frame to use plyr and dplyr package

  • mg<-merge(train_df, test_df, by = intersect(names(train_df), names(test_df)),all=TRUE)

The mg is the merged dataset (10299x561)

  • features<-read.table("UCI HAR Dataset/features.txt")

Taking the name of variables (561)

  • coln<-features[,2]
  • coln<-as.vector(coln)
  • colnames(mg)<-coln

Here we are using this name from features to our merged dataset

  • valid_column_names <- make.names(names=names(mg), unique=TRUE, allow_ = TRUE)
    • names(mg) <- valid_column_names

I found out that R does count some column names from features as duplicate. This is a solution to give them a Unique name and solve this problem So, now mg is a data set 10299x561 with unique column names

  • mgs<-select(mg,matches("(mean|std)"))

And we need columns with only mean and std in it (10299x86)

  • ytest<-read.table("UCI HAR Dataset/test/y_test.txt")
  • ytrain<-read.table("UCI HAR Dataset/train/y_train.txt")

We need to find out which observation for which activity. This information can be found in this two files

  • mgac<-bind_rows(ytrain, ytest)
  • mgac<-as.data.frame(mgac)
  • rown<-mgac[,1]
  • rown<-as.vector(rown)

We need to merge name of observation in one set and take a vector of it. So we have a vector of 10299 which consist only of digitals from 1 to 6

  • row.f<-factor(rown,labels =c('WALKING', 'WALKING_UPSTAIRS' , 'WALKING_DOWNSTAIRS', 'SITTING', 'STANDING' , 'LAYING'))

Making a factor of it with name of each activity

*mgs<-cbind(mgs,row.f)

Append this vector for data set with selected variables of mean and std

  • walking<-filter(mgs,row.f == "WALKING")
  • walkingup<-filter(mgs,row.f == "WALKING_UPSTAIRS")
  • walkingdo<-filter(mgs,row.f == "WALKING_DOWNSTAIRS")
  • sitting<-filter(mgs,row.f == "SITTING")
  • standing<-filter(mgs,row.f == "STANDING")
  • laying<-filter(mgs,row.f == "LAYING")

We are making 6 data sets, each set has observation for only one Activity

  • walking <-summarise_each(walking,funs(mean), -row.f)
  • walkingup <-summarise_each(walkingup,funs(mean), -row.f)
  • walkingdo <-summarise_each(walkingdo,funs(mean), -row.f)
  • sitting <-summarise_each(sitting,funs(mean), -row.f)
  • standing <-summarise_each(standing,funs(mean), -row.f)
  • laying <-summarise_each(laying,funs(mean), -row.f)

Taking average of each variable for for only one Activity and excluding column name which cannot be averaged(name of activity)

  • walking<-cbind(walking,Activity="Walking")
  • walkingup<-cbind(walkingup,Activity="Walking_upstairs")
  • walkingdo<-cbind(walkingdo,Activity="Walking_downstairs")
  • sitting <-cbind(sitting,Activity="Sitting")
  • standing <-cbind(standing,Activity="Standing")
  • laying <-cbind(laying,Activity="Laying")

Now we have 6 data sets with 1 observation and 86 variables, variables are averaged from previous one. We need to add a column with the name of activity

  • ac2<-merge(walking, walkingup, by = intersect(names(walking), names(walkingup)),all=TRUE)
  • ac3<-merge(ac2, walkingdo, by = intersect(names(ac2), names(walkingdo)),all=TRUE)
  • ac4<-merge(ac3, sitting, by = intersect(names(ac3), names(sitting)),all=TRUE)
  • ac5<-merge(ac4, standing, by = intersect(names(ac4), names(standing)),all=TRUE)
  • final<-merge(ac5, laying, by = intersect(names(ac5), names(laying)),all=TRUE)

We need to merge this 6 data sets into the one. It has 6 observation and 87 variables.

  • col_idx <- grep("Activity", names(final))

  • final <- final[, c(col_idx, (1:ncol(final))[-col_idx])]

We need to order columns, so the Name of activity is the first. Of course we could use row names, but it is written than we have to use row.name=FALSE when we ll write the file

  • write.table(final,"final_solution.txt",row.name=FALSE,col.names = FALSE )

File is recorded