/Odds-Are-Odd

Web scraping, machine learning, data viz

Primary LanguageJupyter Notebook

Odds-Are-Odd

Is odds low enough trustworthy in predicting a home win? Shall I never go with high odds? What is the balance between risk and profit in soccer betting?

soccer betting

To address the aforementioned questions, odds movements data of Season 2018/2019 for top five European leagues (Premier League from England, La Liga from Spain, Bundesliga from Germany, Serie A from Italy, and Ligue1 from France) as well as those for ongoing Season 2019 of MLS (U.S) were scraped from 310win.com using Selenium. Specifically, for each league, a randomly selected week of match was put aside for model testing purpose. All other weeks of matches were used to generate machine learning models. Match info. has been translated back to English during data processing.

Scraped data including columns of match week ("week"), names of home ("home") and away ("away") teams, betting company ("company") where the odds came from, odds for home team to win ("win_odds"), draw ("draw_odds"), and lose ("lose_odds") the match, time of odds apart from start of the match as calculated in minutes ("odds_delta_time"), and match result ("result"). In considering the trainings of machine learning (ML) models, all columns besides "result" were considered as features whereas the latter as target. It turned out that data in feature columns were not normally distributed, especially from the kurtosis point of view.

feature data distr

Correlation study of the feature columns further revealed that "lose_odds" was actually highly correlated with "draw_odds" (sometimes with "win_odds" as well).

feature corr

Target column ("result"), on the other hand, was tested for time distribution of the odds. Depending on the leagues, there actually existed a threshold time line, beyond which odds records were rare. For Premier League and Bundesliga, the window period is 21 days; for La Liga, Serie A, and Ligue1, it is 14 days; and for MLS, it is only 7 days. (It somewhat echoed the predicting performance of odds by the end of this project, seemingly the more familiar the betting company are with the league, the longer the window is, and the better their oddes in predicting match results are).

odds time distr

Taken all the above analyses into consideration, we did not use linear regression model in this project due to feature data distribution. The ML models trained here were Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Neural Network (NN). Betting companies chosen here were Bet 365, William Hill, and Bwin. 12Bet, headquartered in Philippines, was also selected in order to have some Asian views. We have feature counts of either 4 or 3 (excluding "loss_odds") and I have target either with all odds data or only with those within the window period. With that, I was able to run my first attempt to screen for ML models. Again, that is 6 (leagues) x 4 (betting company) x 2 (features) x 2 (target) x 4 (models) = 384 runs.

After more than 24h running, ML models with the best performances across all leagues were RF and KNN with the number of neighbors as 5. And 12Bet is the best betting company we might want to retreive odds movements data from to train our model. Removal of odds that were too "old" facilitates for two thirds of leagues in building up ML models. Dispite a high correlation with odds for other match result(s), retaining "lose_odds" as feature is beneficial in building up ML models.

top scored testing score

I next fine tuned ML models with GridSearchCV. As can be told from the below analysis, RF models seemed to perform a little bit better both in testing and in precision and recall scores over KNN. Ideally I should be using RF models for prediction. However, since the size of saved RF models way exceeded Github's maximum file uploading requirement, I instead used KNN model for demo in html.

model comparison final

Html was made to 1) link to 310win.com to showcase data web scraping; 2) link to Tableau Public to demonstrate data process and analyses; and 3) open a new html for match prediction (for MLS, it is match prediction; for the other five European leagues, it is validation using the randomly selected week of matches from Season 2018/2019 that were saved earlier).

League logos on prediction html serve as form button to pass league info to Flask route, which info is then used to make calls to either saved .csv file or designated webpages in 310win.com to freshly scrape odds data. Retrieved data are then used to predict match results. Such results are sent (render_template) back to prediction html as string to be retrieved by internal javascript. Match info. is eventually processed by external static javascript and appended on prediction html for display.

prediction example

App has been deployed on Heroku. Note that Heroku by default prevents the installation of custom software to run the browser that Selenium expects to exist (See detailed discussion). Once MLS logo in "prediction" page is clicked, instead of being redirected for Selenium-driven data scrapting, it gives a 500 Internal Server Error.