/premier_league_analysis

Analysis of the Premier League games and seasons since 1992.

Primary LanguageJupyter Notebook

alt text

Project Intro/Objective

This repo is about the Premier League analysis. I wanted to understand more about the history of this League and so collected all the data about the games and players, cleaned it, conducted an analysis, and wrote an article on Medium about it.

Project Description

This project is decomposed into three sections (i.e three jupyter notebooks):

  • Data collection: this was the process of collecting using webscraping, all the data about the games, the players, and all the events that happened in every game (goals, red/yellow cards, substitution, etc). I used the official website of the Premier League as the main data source and after navigating through it, I understood how the data was displayed and the best way of getting it:

    • First, collect the ID of each season from a dropdown menu on the games' pages
    • Loop through each season's page (https://www.premierleague.com/.../{season_id}) to collect each game id
    • Loop through each game's page (https://www.premierleague.com/.../{match_id}) to collect each game's data. Luckily for me, the website stores the game's data (with a lot more data that I needed) in a JSON format readable in the html. I just had to flatten the json to have tabular data that I splitted in multiple files (games, events, players)
  • Data cleaning: There was way too much information in those JSONs so I removed some columns, reformatted others, dealed with missing values and generally cleaned the data to have a suitable format for the analysis (EDA).

  • Data Analysis & Visualization: Definitely the most exciting and sexy to read! I analyzed the data to find some interesting facts about the league. Not all insights have a viz but here are some viz where you can enjoy the interactibility as they're all made with Plotly!

Notes

  • I couldn't managed to make the charts of the EDA notebook interactive on github so the only way to play with them is to run the EDA notebook or to check out some below.
  • The EDA has a lot of insights that dont have charts, go take a look!

Top 10 nicest charts and their link for interactibility with Plotly

If you don't wanna run EDA.ipynb, check the charts below or in docs/images, you can open them by putting http://htmlpreview.github.io/? in front of http in the url.

Link for interactibility Distribution_of_events_over_the_minutes_of_the_games

Link for interactibility Goals_vs_assists_per_season

Link for interactibility Minutes_played_vs_goals_scores_vs_assists_vs_player's_position

Link for interactibility Distribution_of_points_big_6_vs_other_teams

Link for interactibility Number_of_time_each_team_has_finished_at_which_place_of_the_podium

Link for interactibility Origin_of_the_players

Link for interactibility Points_of_difference_with_the_team_ranked_after_itself

Link for interactibility Ranking_of_the_Big_6_per_year

Link for interactibility Ratio_Win,Draw,Lost_in%-_All_seasons_included

Link for interactibility Cleansheets_per_season_(at_leat_10)