Workouts Classifier: Project Overview
- Built a model that accurately predicts what type of workout (ride, run, hike, etc.) I have completed.
- Extracted 1280 workouts and their accompanying data using the Strava API.
- Cleaned messy data and Interpolated missing data
- Optimized Logistic Regression, Random Forrest, and XGBoost Classifers to reach the best model.
Code and Resources Used
- Python Version: 3.8.5
- Packages: pandas, numpy, stravalib, sklearn, matplotlib, seaborn, altair
- Inspiration: https://github.com/PlayingNumbers/ds_salary_proj
Data Collection
Using the stravalib library and the Strava API I extracted 1280 of my strava activities. With each activity, we got the following:
- type
- date
- moving_time
- activity_id
- name
- distance
- elevation gain
- trainer
- average_speed
- max_speed
- average_watts
- suffer_score
- average_heartrate
- average_cadence
- kilojoules
- gear_id
- average_temp
- start_longitude
- start_latitude
- timezone
- location_city
- location_state
- location_country
Data Cleaning
After extracting the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:
- Checked all columns for missing values
- Filled in any missing values for average_heartrate, kilojoules, suffer_score, and average temp with the average of those columns
- Changed any blank values for average_cadence and average_watts to 0.
- Feature engineered various new features sorrounding dates (year, month, day of the week etc.) and locations.
- Removed workout types I wasn't interested in classifying
EDA
I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.
Model Building
First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.
I tried three different models and evaluated them using accuracy as my primary metric but also looking into recall and f1-score
I tried three different models:
- Logistic Regression – Baseline for the model
- Random Forest – Because of the sparse data from the many categorical variables, I thought a Random Forest would be effective.
- XGBoost – Given XGboost typically out performs other algorithms, I thought that this would be a good fit.
Model performance
The XGBoost model far outperformed the other approaches on the test and validation sets.
- Logistic Regression : Accuracy = 92.97%
- Random Forest: Accuracy = 98.44%
- XGBoost: Accuracy = 100%