The Beautiful Game. Investigating European Football Database

Abstract:

This football dataset is obtained from Kaggle. The data contains more than 25,000 matches, +10,000 players, 11 European countries, team squad formation with (X, Y) coordinates, detailed match events; for example, goal types, possessions, fouls, cards and many more. The dataset spans from 2008 to 2016 seasons and it comes in SQLite database format with 7 tables (Country, League, Match, Player, Player_Attributes, Team, and Team_Attributes). In addition, there are 199 columns combined in this database. We will extract what serves our purpose of analysis and try answer some questions; for instance, what team improved over the period of time? which teams had scored the most number of goals? what attributes that leads the team to most victories? and also dig in to explore players distinctions that dominates the game.

All thanks to Hugo Mathien for dedicating the time and effort to make this possbile. Further reading and ways to improve the project can be found in Hugo's github repo here.

Credit goes to Mr. Abdulelah Alnajem from Ministry of Sport for pointing out the anomaly in third question. It turns out the "Penalties" column reference to the attribute of players fetched from FIFA game API; therefore, no sufficient information we can pull to answer this question. As a result, changing the question is necessary.

Research Questions:

Our base research questions are limited to three. Since Exploratory Data Analysis (EDA) is a continuous cycle, redefining our base is inevitable and thus adding more questions is certain. Our scope in framing questions is to hopefully help sports analysts to extract useful information, technical team managers to fully utilize their best potentials, and the novice to better understand the game. Questions goes as follows:

Which team had scored the most goals from 2008 to 2016?
What team improved the most over the time period?
~~Which player had the most penalties?~~
Which league had the maximum number of goals?

Data Description:

Our data is in the form of SQLite database. This database was created by Hugo Mathien on 09/07/2016 and uploaded to Kaggle under Database Contents License (DbCL) v1.0 by Open Data Commons.

There are more than 25,000 matches, 10,000 player details spanning from 2008 to 2016 seasons. Detailed match events (goal types, possession, corner, cross, fouls, cards, etc...) as well as players and team attributes are also present.

The dataset is relational and includes 7 tables (199 columns in total) with heavy emphasis on boolean values. Our goal here is to consolidate tables we need to conduct our exploration.

Key mentioned attributes

Variable	Description
home_team_api_id	identifier for home team fetched from FIFA api
away_team_api_id	identifier for away team fetched from FIFA api
league_id	identifier for league id
date	date of the match
home_team_goal	number of home team goals
away_team_goal	number of away team goals
buildUpPlaySpeedClass	classification of team build up play speed; 3 values: Balanced Fast Slow
buildUpPlayPassingClass	classification of team passing style; 3 values: Short Long Mixed
player_name	name of the player
overall_rating	player overall rating based on FIFA game
preferred_foot	player's favorite foot, right or left

Tools:

The analysis is going to be developed on an IPython notebook. The tools to be used in order to perform our tasks are:

Python 3.7
SQLite3 -- This will allow us to establish a database connection
Pandas
Numpy
Matplotlib
Seaborn
Pandas Profiling -- generates detailed report with an overview of the dataset, variable properties, interactions, correlations and missing values

abduliante/the-beautiful-game

The Beautiful Game. Investigating European Football Database

Abstract:

Research Questions:

Data Description:

Tools: