This repository contains the data, R code and project report for Practical Machine Learning course on Coursera.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here. The test data are available here. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
The goal of this project is to predict the manner in which they did the exercise. This is the "classe" variable in the training set. We may use any of the other variables to predict with.
In this project, I am going to use 'caret' package for building a predictive model. First, this package has been loaded into R and then the training and test data were read into R.
library(caret)
data<-read.csv('pml-training.csv', header=T)
data_test<-read.csv('pml-testing.csv', header=T)
Feature selection largely influences the training time and the model performance. Therefore, it is crucial to retain the important features and remove the uninteresting features. Examples of such uninteresting features include those that contain lot of missing values, those that exhibit high correlation with other features (redundancy), and those with variance near zero.
Here, I excluded all those variables with more than 70% of missing values (NAs)
nast<-sapply(1:ncol(data),function(i) if(sum(is.na(data[,i]))>0.7*nrow(data)){return(TRUE)}else{return(FALSE)})
data<-data[,!nast]
All those variables with variance near zero were excluded.
nzv <- nearZeroVar(data, saveMetrics = T)
data<-data[,!nzv$nzv]
All those fatures that exhibit high correlation (correlation coefficient > 0.70) with other features (not with the outcome variable classe) were also excluded. Also some features like the username and time-stamps that are not requred for prediction were also eliminated.
cor_train<-cor(data.matrix(data[,1:ncol(data)-1]))
highCorr<-findCorrelation(cor_train,0.70)
data<-data[,-highCorr]
data<-data[,5:ncol(data)]
The provided training data has been further partitioned into training and test datasets in 60:40 ratio.
inTrain<-createDataPartition(y=data$classe, p=0.6, list=FALSE)
training<-data[inTrain,]
testing<-data[-inTrain,]
For cross validation, 3 repeats of 10-fold cross validation method was used while training the model.
train_control<-trainControl(method='repeatedcv', number=10, repeats=3, verboseIter = TRUE)
Due to the popularity of the higher accuracy of ensemble methods, I chose to use a widely used ensemble method, the random forests, to train the model.
model<-train(classe~.,data=training, trControl=train_control, method='rf')
pred_train<-predict(model,newdata=training)
confusionMatrix(pred,training$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 3348 0 0 0 0
B 0 2279 0 0 0
C 0 0 2054 0 0
D 0 0 0 1930 0
E 0 0 0 0 2165
Overall Statistics
Accuracy : 1
95% CI : (0.9997, 1)
No Information Rate : 0.2843
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The trained model exhibited an accuracy of 99.67% and the estimated out of sample error rate is 0.33%.
pred_test<-predict(model,newdata=testing)
confusionMatrix(pred1,testing$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 2229 2 0 0 0
B 0 1516 9 3 0
C 0 0 1355 5 0
D 1 0 4 1278 0
E 2 0 0 0 1442
Overall Statistics
Accuracy : 0.9967
95% CI : (0.9951, 0.9978)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9958
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9987 0.9987 0.9905 0.9938 1.0000
Specificity 0.9996 0.9981 0.9992 0.9992 0.9997
Pos Pred Value 0.9991 0.9921 0.9963 0.9961 0.9986
Neg Pred Value 0.9995 0.9997 0.9980 0.9988 1.0000
Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2841 0.1932 0.1727 0.1629 0.1838
Detection Prevalence 0.2843 0.1947 0.1733 0.1635 0.1840
Balanced Accuracy 0.9991 0.9984 0.9949 0.9965 0.9998
The trained model has been used to predict the classe variable of the unknown test data with 100% accuracy. The answers have been submitted to the course webpage.
pred_unknown<-predict(model,newdata=data_test)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(pred_unknown)