/procyclingstats

ProCyclingStats Points Prediction

Primary LanguageJupyter Notebook

ProCyclingStats Points Prediction

Objective

The goal of this project was to determine how accurately I could predict ProCyclingStats points with linear regression using features scraped from the race startlist, rider resume, and individual rankings pages of "the single most useful cycling database on the worldwide web" (Rouleur, 2017).

Tools

  • Requests for pulling content
  • BeautifulSoup for saving and parsing HTML
  • Numpy and Pandas for data manipulation
  • Statsmodels and Scikit-learn for modeling
  • Matplotlib and Seaborn for plotting

Scraping & Parsing

I used the following races' startlists to generate a list of riders whose profile pages I would scrape for PCS points by season as well as a few other characteristics for my model.

Startlists - 85 total

  • Since 2007: Giro d'Italia, Tour de France
  • Since 2010: Vuelta a España
  • Since 2013: Amstel Gold Race, Gent–Wevelgem, Il Lombardia, Liège–Bastogne–Liège, Milan–San Remo, Paris–Nice, Paris–Roubaix, Ronde van Vlaanderen, Strade Bianche, Tour of California, Tour de Suisse

The output was 1,730 unique riders which comprised 17,806 pages for about 10 seasons of data per rider. I also scraped the PCS individual rankings for each year since 2005 totaling 292 pages; however this ranking data has not been incorporated into my modeling yet.

Features per Rider - those in italics were parsed but not loaded into model

  • rider name
  • team name
  • nationality
  • date of birth (converted to age by year)
  • height
  • weight
  • lifetime points by category: one day, general classification, time trial, sprint
  • total race seasons
  • dates raced by year
  • race names by year
  • distance raced by year
  • number of race days by year
  • number of stage races by year
  • UCI points won by year
  • PCS points won by year

Modeling & Evaluation

The target variable (S0) for each rider was their most recently completed full racing season's PCS points (i.e. 2018 was excluded). This allowed the inclusion of retired riders alongside currently active ones, despite their careers spanning different years without overlap, with S1, S2, etc. as relative references to prior seasons. Age was the only other variable besides points that was used on a per year basis with the same nomenclature, Sn.

Data was split into 80% training and 20% testing (holdout) sets. All model evaluation and selection was performed with 5-fold cross-validation on the training set. A single final test was done on the 20% out of sample data.

Baseline

  • Ordinary least squares linear regression with a single feature, S1 points
  • 0.642 R2
  • 0.578 avg. CV R2
  • 183 avg. CV RMSE

OLS with 7 features

  • Addition of selected features above
  • 0.708 R2
  • 0.654 avg. CV R2
  • 166 avg. CV RMSE

OLS with 74 features

  • All non-italics features above with nationality as dummy variables
  • 0.760 R2

Lasso with scaled features

  • Regularization with optimal alpha selected 21 of 74 features
  • 0.749 R2
  • 0.715 avg. CV R2
  • 151 avg. CV RMSE

OLS with scaled features

  • 21 features without Lasso's coefficients
  • 0.756 R2
  • 0.710 avg. CV R2
  • 149 avg. CV RMSE

Out of sample OLS test

  • 21 features
  • 0.687 R2
  • 137 avg. CV RMSE

Observations & Next Steps

The target distribution is highly right skewed and predictive accuracy may benefit from a log or Box-Cox transformation. The residual plots also exhibited heteroskedasticity. It is worth exploring higher degree polynomial terms to evaluate their impact on regression fit. Given the different classes of riders (domestiques, sprinters, climbers, general classification, time trial, and one-day race specialists), there is likely a classification component to points prediction. Non-linear models such as random forest and gradient boosting trees may perform better at handling these distinct groups while also avoiding issues arising from violations of OLS assumptions. Incorporating some of the unused features above and new ones such as watts/kilogram, team performance, rankings information, and data from Dopeology would be important to consider regardless of the algorithm chosen.