
Built a model that accurately predicts what type of workout (ride, run, hike, etc.) I have completed based on my personal strava data

Workouts Classifier: Project Overview

  • Built a model that accurately predicts what type of workout (ride, run, hike, etc.) I have completed.
  • Extracted 1280 workouts and their accompanying data using the Strava API.
  • Cleaned messy data and Interpolated missing data
  • Optimized Logistic Regression, Random Forrest, and XGBoost Classifers to reach the best model.

Code and Resources Used

Data Collection

Using the stravalib library and the Strava API I extracted 1280 of my strava activities. With each activity, we got the following:

  • type
  • date
  • moving_time
  • activity_id
  • name
  • distance
  • elevation gain
  • trainer
  • average_speed
  • max_speed
  • average_watts
  • suffer_score
  • average_heartrate
  • average_cadence
  • kilojoules
  • gear_id
  • average_temp
  • start_longitude
  • start_latitude
  • timezone
  • location_city
  • location_state
  • location_country

Data Cleaning

After extracting the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

  • Checked all columns for missing values
  • Filled in any missing values for average_heartrate, kilojoules, suffer_score, and average temp with the average of those columns
  • Changed any blank values for average_cadence and average_watts to 0.
  • Feature engineered various new features sorrounding dates (year, month, day of the week etc.) and locations.
  • Removed workout types I wasn't interested in classifying


I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

alt text alt text alt text alt text

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using accuracy as my primary metric but also looking into recall and f1-score

I tried three different models:

  • Logistic Regression – Baseline for the model
  • Random Forest – Because of the sparse data from the many categorical variables, I thought a Random Forest would be effective.
  • XGBoost – Given XGboost typically out performs other algorithms, I thought that this would be a good fit.

Model performance

The XGBoost model far outperformed the other approaches on the test and validation sets.

  • Logistic Regression : Accuracy = 92.97%
  • Random Forest: Accuracy = 98.44%
  • XGBoost: Accuracy = 100%