Weight Lifting Exercise Analysis
Christoph Fabianek
Sunday, August 23rd, 2015


This project investigates data collected during weight lifting exercises and applys a machine learning algorithm from the CARET Package of the R programming language to predict the manner in which exercises were performed. This report was written for the course Practical Machine Learning of the Coursera Data Science Specialization.


Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har.

Data Processing

First the underlying training and test data are downloaded from the web and read. For the scope of this analysis the data is cleaned in the following way:

  • remove the first 7 columns (X, user_name, time_stamps, *_window) since they are not relevant for classification
  • remove columns with over 60% NAs
  • remove near zero variance predictors
  • convert classe into a factor variable
# knitr options
options(scipen = 10, digits = 2)

# load libraries

# load & read data
if(!file.exists("data")) {
if(!file.exists('./data/pml-training.csv')) {
        fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
        dateDownloaded_training <- date()
training <- read.csv("./data/pml-training.csv", header = TRUE)

if(!file.exists('./data/pml-testing.csv')) {
        fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
        dateDownloaded_test20 <- date()
test20 <- read.csv("./data/pml-testing.csv", header = TRUE)

## Data Cleaning
# remove first 7 columns
training <- training[, 8:ncol(training)]

# remove columns with >60% NAs
NAs <- apply(training, 2, function(x) {sum(is.na(x))})
training <- training[, which(NAs < nrow(training)*0.6)]

# remove near zero variance predictors
NZVs <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[, NZVs$nzv == FALSE]

# convert classe into factor
training$classe <- factor(training$classe)

Afterwards the dataset is split into a 60% training and a 40% testing set.

trainset <- createDataPartition(training$classe, p = 0.6, list = FALSE)
data_training <- training[trainset, ]
data_testing <- training[-trainset, ]

Model Fitting

Based on various tests Random Forest with 10 fold Cross Validation is chosen as algorithm to get a small out-of-sample error. (For performance reason a parallel cluster is setup.)

cluster <- makeCluster(detectCores()-1)
ctrl <- trainControl(method = "cv",
                     number = 10,
                     allowParallel = TRUE)
model <- train(classe ~ ., data = data_training, method = "rf", 
               trControl = ctrl, prox = FALSE)


cm <- confusionMatrix(predict(model, data_testing), data_testing$classe)

To get an unbiased estimate of the model performance (Random Forest with 10-fold Cross Validation) it is applied to the so far untouched testing dataset:

  • The confusionMatrix states an Accuracy of r cm$overall["Accuracy"]*100%.
  • The expected Out-of-sample Error is r (sum(predict(model, data_testing) != data_testing$classe)/length(data_testing$classe))*100%.

Finally, the following figure shows the importance of the variables:

plot(varImp(model), main = "Importance of Top 20 Variables", xlab="Importance in %", top = 20)


The Random Forest algorithm with Cross Validation provides great results (high accuracy and low error rate) out of the box without much tweaking. It was interesting to experiment with various parameters for the used algorithms to improve performance on the local machine. Nevertheless, the overall best result was achieved with default settings.


Prediction Assignment Submission

The generated model is applied to the original test data stored in test20 and written to problem_id_X.txt according to the instructions.

