/tdi-project

Capstone Project for The Data Incubator

MIT LicenseMIT

The Data Incubator Capstone Project

My proposed project is to analyze trends in the English, Spanish and Italian soccer leagues and build models to accurately predict how to win these leagues.

I am a big soccer fan and every weekend, I religiously watch the games cheering on my favorite team. Like any sports buff, I also follow the 'sports media' and I have observed that the predictions made by different pundits often seem to fail. The teams in supposed disarray often end up winning the league while other 'sure bet' teams almost always falter. The sports media also routinely hypes up certain games and in my opinion under weighs relative importance of certain other matches. This is similar to other arenas such as politics where web sites like FiveThirtyEight have done a good job at making polling models to predict elections more accurately than pundits. My eventual goal is to build such models and accurately weigh both the results of each game as well as the underlying sentiment surrounding the team which might affect morale. I also want to understand the temporal evolution of these leagues in terms of competitiveness, offensiveness (goals scored) and defensiveness (goals conceded) and correlate these measures to performance in the Champions League (an elite European competition of the best teams across all leagues).

I envision two seperate but complimentary parts of this analysis. The first part involves studying the league tables over time. For equal comparison I will study league tables from 1991, since that is when the English Premier League came into existence. I will analyze trends of games won by champions, goals scored, home wins, away wins along with performance against other best teams. I will compare the data for the champions with data from the teams coming 2nd, 3rd and 4th to learn what seperates the champion from the other good sides in the league. I envision using decision trees or random forest models to eventually build a tool capable of predicting league position using win ratio and other parameters. Next, I will correlate the strength of the league (number of points won by teams in the Champions League) with the performance of the league's teams in the Champions League. I have already started this phase of my analysis and discovered 2 interesting things:

  1. Wins against 5 mid table teams are more predictive of a champion than wins against 5 top teams

  2. The EPL is becoming more competitve once again

During TDI, I will complete this analysis and build a model which predicts expected league position based on current results.

The second part of analysis revolves around understanding the sentiment around the team. I plan to do this by analyzing tweets about the team pre-season and after every game week during the season from 2008-2016. The teams transfer players during pre-season and this often increases or decreases hype around them which affects how the pundits in media rate their chances. Each tweet about ever team will be tokenized, stemmed and AFINN will be used to assign a sentiment score to every team. My plan is to correlate the sentiment about the team during the pre-season to their eventual league position and learn how predictive it is. I will perform similar analysis after each game week to see how the sentiment changes over time during the year.

I will combine all these results in an app which will allow a user to input sentiment around the team and game results to see in real time the chances of winning the league.