The goal of this project is to create an accurate machine learning model to predict the speed rating of all finishers in any given race in New York State. The model will be trained, valuated, and tested on speed ratings from 2014 to 2019 along with weather data from the National Centers for Environmental Information and course specific information from Tully Runner's and Milesplit. The hope is that this model can be applied to future high school cross country races and produce a higher quantity of accurate speed ratings.
The repository is organized as follows:
data/
contains all of the raw and preprocessed data used in the projectfigures/
contains all of the figures generated in the projectreport/
contains reports on the development and results of the projectresults/
contains all of the results generated in the project (including the final trained models)src/
contains all the code. The code is organized as follows:compile.ipynb
contains code for compiling the raw datapreprocessing.ipynb
contains code for preprocessing the compiled raw dataeda.ipynb
contains code for exploratory data analysiselasticnet_ml.ipynb
contains code for training and testing the elastic net modelrandomforest_ml.ipynb
contains code for training and testing the random forest modelxgb_ml.ipynb
contains code for training and testing the xgboost modelknn_ml.ipynb
contains code for training and testing the k-nearest neighbors modelevaluation.ipynb
contains code for evaluating the final models
This project is built in Python 3.11.4 and uses the following key packages:
- joblib 1.3.2
- matplotlib 3.7.2
- numpy 1.24.4
- pandas 2.0.3
- scikit-learn 1.3.0
- scipy 1.11.3
- seaborn 0.12.2
- shap 0.42.1
- xgboost 1.7.6
The environment.yml
file contains a full list of the packages for this project.