Horse Race Prediction

Background
Problem Statement
Data Sources
Executive Summary
Data Dictionary
Recommendations
Conclusion

Background

Horse racing has a long and distinguished history, practised in civilisations across the world since ancient times. The modern horse racing became well-established in the 18th century in Britian. It continued to grow in popularity till this day, and was one of the few sports that continued during the Covid-19 crisis in Australia and Hong Kong.

Horse racing is the most important sport in Hong Kong. With only 24 trainers and a similar number of jockeys, participants are firmly in the spotlight. The regulation and governance of the horse racing industry comes under the supervision of the Hong Kong Jockey Club.

Problem Statement

Punters have access to information available on the HKJC website, with veterinary and trackwork record available at the click of a button. There are many factors that could affect the race result and millions have tried to find a winning formula in order to make a profit from betting. We did a simple test to see what would happen if we had bet $1 on every horse with the lowest odds. Theoretically, this should work since the horse with the lowest odds tends to be the favourite. Our results showed otherwise.

And so the problem we want to address here is:

Can we use machine learning to make predictions to profit from horse races?

We will follow the data science process to answer this problem.

Define the problem
Gather & clean the data
Explore the data
Model the data
Evaluate the model
Answer the problem

Data Sources

The dataset contains the race result of 1561 local races throughout Hong Kong racing seasons 2014-17. They can be downloaded from Kaggle at this link.

The data dictionary will be provided at the bottom of this file.

Executive Summary

INTRODUCTION

This project seeks to make predictions on the outcome of horse races through both classification and regression models. For classification models, we aim to predict the winner and top 3 positions of a race. For regression models, we aim to predict the finish time of the horses, hereby predicting the winner of the race.

With the prediction results, we will make bets using different strategies to profit from the horse race. Backtesting results of each model will also show the number of bets and profit made from each strategy.

METHODOLOGY

The work was done in 7 seperate notebooks.

Preprocessing - Cleaning and tidying of data, Feature Engineering
EDA - Data visualisation and analysis of key patterns
Classification Modelling - Training dataset fitted on 4 models to get classfication predictions
Regression Modelling - Testing dataset fitted on 4 models to get regression predictions
Evaluation - Evaluation of results, Feature Importance, SHAP values
Backtesting - Using betting strategy to answer the problem statement of whether we can profit from horse races
Deployment - To build an application using Streamlit, where punters can key in simplified inputs to get a prediction on whether to bet on a horse.

The application was deployed on Streamlit and can be accessed through this link. A screenshot showing the app is shown below. Please note that this app was only intended as an educational and demo tool, and not meant to be used for real life betting.

SIGNIFICANT FINDINGS

Classification models were evaluated on their F1, PR-AUC Score, Precision and Recall as the dataset is highly imbalanced when predicting the positive class of the top position. There is a tradeoff between precision and recall and in the case of making correct predictions, precision would be the more important consideration.

For regression models, the metric used was the root mean squared error (RMSE). Models were trained to generalized. With a regression prediction, finding the fastest horse from each race allowed us to obtain the accuracy (or precision) of predicting the top position and top 3 positions.

The SHAP summary plot showed that lower values of a horse's recent ranks contributed to a higher probability of the horse finishing top. The quality of the jockey, as shown by his recent ranks, also play a big role in determinining if a horse will win.

In the backtesting phase, we ran our model predictions through different strategies, and found that 7 out of 8 models actually returned a positive value. In the notebook, we ran a few strategies, but the simplest ones were:

Bet $1 when model that predicted a horse will win the top position
Bet $1 on the horse that model predicted with the fastest time during a race

A summary of the backtesting results when ran on these two strategies are shown below.

Model	Money	Bets Made
SMOTE + RF	375.2	743
Random Forest Classifier	268.1	68
Logistic Regression	23.0	32
Gaussian Naive Bayes	10.7	177
Ridge Regression	360.4	480
LGBM	307.3	617
KNN Regression	237.6	480
Random Forest Regressor	-48.1	542

Data Dictionary

There are two datasets obtained from Kaggle, courtesy of the Hong Kong Jocket Club. The first is the related to the horse and the the second is related to the race. Both of these tables can be joined on the race_id column.

Columns	Description
finishing_position	The rank of the horse. (E.g. the horse with finishing_position 1 is the first to finish)
horse_number	The number for the horse in the specific race. (Note that the same horse may have different numbers in different races)
horse_name	English name of the horse
horse_id	ID of the horse. (The ID for a horse is unique in all the races)
jockey	The one who rides the horse in the race. (A jockey can ride different horses in the races)
trainer	The one who trains the horse. (Multiple horses from a trainer can appear in the same race)
actual_weight	The extra weight that a horse carries in the race. (The horses with better performances in the previous races will carry extra weights to make the race more competitive)
declared_horse_weight	The weight of the horse on date of the race.
draw	The position of the horse at the starting point. The inner positions are usually advantageous and correspond to smaller draw numbers.
length_behind_winner	The length behind the winner at the finish line. The unit is “horse length”.
running_position_i	The rank of the horse at the i-th timing point. (The running position will be “NA” if the total distance of the race is short and the horses do not cross the particular timing point)
finish_time	The total time from the starting point to the finish line. The unit is in seconds.
win_odds	The multiplier of the amount you bet to be received if you win. THe odds are usually determined automatically by the total money bet on each horse.
race_id	The ID of the race for this entry. The race_id is consistent in the two data files.
race_distance	The race distance in metres for each race

Recommendations

Backtesting results were good, but I cannot be sure if they are reflective of real life horse racing.
One of the drawbacks of the backtesting is that the races were all treated as if on the same starting ground. In reality, results from a race would have to be updated into the model, for retraining, to predict the results of the next race. Due to time constraints, we have simplified the problem and saved ourselves time and effort to retrain the model multiple times.
I treated one row of data as one sample. I perhaps should have treated all rows of a unique race as one sample as we want to see which horse can win relative to its race opponents.
I am unable to make a prediction if the horse is new and has not raced before.
Try out more complicated models to see how they fare?

Conclusion

Overall, we were able to get a good result from the models and backtesting. Most of the models and strategies, though simplistic, allowed us to "profit" over the course of 2,000 races. I am convinced that using one of these statistical models would give us an edge over the average punter, but of course I would have to test this out in real life to prove it!

ethan-eplee/HorseRacePrediction