- Background
- Problem Statement
- Data Sources
- Executive Summary
- Data Dictionary
- Recommendations
- Conclusion
Horse racing has a long and distinguished history, practised in civilisations across the world since ancient times. The modern horse racing became well-established in the 18th century in Britian. It continued to grow in popularity till this day, and was one of the few sports that continued during the Covid-19 crisis in Australia and Hong Kong.
Horse racing is the most important sport in Hong Kong. With only 24 trainers and a similar number of jockeys, participants are firmly in the spotlight. The regulation and governance of the horse racing industry comes under the supervision of the Hong Kong Jockey Club.
Punters have access to information available on the HKJC website, with veterinary and trackwork record available at the click of a button. There are many factors that could affect the race result and millions have tried to find a winning formula in order to make a profit from betting. We did a simple test to see what would happen if we had bet $1 on every horse with the lowest odds. Theoretically, this should work since the horse with the lowest odds tends to be the favourite. Our results showed otherwise.
And so the problem we want to address here is:
Can we use machine learning to make predictions to profit from horse races?
We will follow the data science process to answer this problem.
- Define the problem
- Gather & clean the data
- Explore the data
- Model the data
- Evaluate the model
- Answer the problem
The dataset contains the race result of 1561 local races throughout Hong Kong racing seasons 2014-17. They can be downloaded from Kaggle at this link.
The data dictionary will be provided at the bottom of this file.
INTRODUCTION
This project seeks to make predictions on the outcome of horse races through both classification and regression models. For classification models, we aim to predict the winner and top 3 positions of a race. For regression models, we aim to predict the finish time of the horses, hereby predicting the winner of the race.
With the prediction results, we will make bets using different strategies to profit from the horse race. Backtesting results of each model will also show the number of bets and profit made from each strategy.
METHODOLOGY
The work was done in 7 seperate notebooks.
- Preprocessing - Cleaning and tidying of data, Feature Engineering
- EDA - Data visualisation and analysis of key patterns
- Classification Modelling - Training dataset fitted on 4 models to get classfication predictions
- Regression Modelling - Testing dataset fitted on 4 models to get regression predictions
- Evaluation - Evaluation of results, Feature Importance, SHAP values
- Backtesting - Using betting strategy to answer the problem statement of whether we can profit from horse races
- Deployment - To build an application using Streamlit, where punters can key in simplified inputs to get a prediction on whether to bet on a horse.
The application was deployed on Streamlit and can be accessed through this link. A screenshot showing the app is shown below. Please note that this app was only intended as an educational and demo tool, and not meant to be used for real life betting.
SIGNIFICANT FINDINGS
Classification models were evaluated on their F1, PR-AUC Score, Precision and Recall as the dataset is highly imbalanced when predicting the positive class of the top position. There is a tradeoff between precision and recall and in the case of making correct predictions, precision would be the more important consideration.
For regression models, the metric used was the root mean squared error (RMSE). Models were trained to generalized. With a regression prediction, finding the fastest horse from each race allowed us to obtain the accuracy (or precision) of predicting the top position and top 3 positions.
The SHAP summary plot showed that lower values of a horse's recent ranks contributed to a higher probability of the horse finishing top. The quality of the jockey, as shown by his recent ranks, also play a big role in determinining if a horse will win.
In the backtesting phase, we ran our model predictions through different strategies, and found that 7 out of 8 models actually returned a positive value. In the notebook, we ran a few strategies, but the simplest ones were:
- Bet $1 when model that predicted a horse will win the top position
- Bet $1 on the horse that model predicted with the fastest time during a race
A summary of the backtesting results when ran on these two strategies are shown below.
Model | Money | Bets Made |
---|---|---|
SMOTE + RF | 375.2 | 743 |
Random Forest Classifier | 268.1 | 68 |
Logistic Regression | 23.0 | 32 |
Gaussian Naive Bayes | 10.7 | 177 |
Ridge Regression | 360.4 | 480 |
LGBM | 307.3 | 617 |
KNN Regression | 237.6 | 480 |
Random Forest Regressor | -48.1 | 542 |
There are two datasets obtained from Kaggle, courtesy of the Hong Kong Jocket Club. The first is the related to the horse and the the second is related to the race. Both of these tables can be joined on the race_id column.
Columns | Description |
---|---|
finishing_position | The rank of the horse. (E.g. the horse with finishing_position 1 is the first to finish) |
horse_number | The number for the horse in the specific race. (Note that the same horse may have different numbers in different races) |
horse_name | English name of the horse |
horse_id | ID of the horse. (The ID for a horse is unique in all the races) |
jockey | The one who rides the horse in the race. (A jockey can ride different horses in the races) |
trainer | The one who trains the horse. (Multiple horses from a trainer can appear in the same race) |
actual_weight | The extra weight that a horse carries in the race. (The horses with better performances in the previous races will carry extra weights to make the race more competitive) |
declared_horse_weight | The weight of the horse on date of the race. |
draw | The position of the horse at the starting point. The inner positions are usually advantageous and correspond to smaller draw numbers. |
length_behind_winner | The length behind the winner at the finish line. The unit is “horse length”. |
running_position_i | The rank of the horse at the i-th timing point. (The running position will be “NA” if the total distance of the race is short and the horses do not cross the particular timing point) |
finish_time | The total time from the starting point to the finish line. The unit is in seconds. |
win_odds | The multiplier of the amount you bet to be received if you win. THe odds are usually determined automatically by the total money bet on each horse. |
race_id | The ID of the race for this entry. The race_id is consistent in the two data files. |
race_distance | The race distance in metres for each race |
- Backtesting results were good, but I cannot be sure if they are reflective of real life horse racing.
- One of the drawbacks of the backtesting is that the races were all treated as if on the same starting ground. In reality, results from a race would have to be updated into the model, for retraining, to predict the results of the next race. Due to time constraints, we have simplified the problem and saved ourselves time and effort to retrain the model multiple times.
- I treated one row of data as one sample. I perhaps should have treated all rows of a unique race as one sample as we want to see which horse can win relative to its race opponents.
- I am unable to make a prediction if the horse is new and has not raced before.
- Try out more complicated models to see how they fare?
Overall, we were able to get a good result from the models and backtesting. Most of the models and strategies, though simplistic, allowed us to "profit" over the course of 2,000 races. I am convinced that using one of these statistical models would give us an edge over the average punter, but of course I would have to test this out in real life to prove it!