Webscrape English Premier League player statistics from 2006-2018
I created the data set I used entirely by webscrapping. I encapsulated all methods and properties in the PlayerScrapper class.
The pipeline occurs in the following fashion:
-
This step provided the list of names and URLs to all the clubs competing in the EPL for that particular year
- Get Club List HTML
- Used Selenium with Chromedriver because the dropdown bar would not update with specific URL.
- Saved HTML to "data/epl/epl_clubs/year/year_epl_clubs.html"
- Example of webpage: Club List
- Parse Club List HTML
- Used BeautifulSoup to extract club/url key/value pairs from local HTML file
- Saved this information as a dictionary in a class variable to be accessed later
- Get Club List HTML
-
This step provided a player list and URLS for each of the twenty clubs competing in the EPL for that particular year
-
Created a Pandas Dataframe with Name, Year, Position, and Nationality and wrote it to a CSV file
- Get Club HTMLs
- Used Selenium with Chromedriver because the dropdown bar would not update with specific URL.
- Saved HTMLs to "data/epl/epl_clubs/year/clubs/club"
- Example of webpage: Club
- Parse Club HTMLs
- Used BeautifulSoup to extract player/url key/value pairs from local HTML file
- Saved this information as a dictionary in a class variable to be accessed later
- Constructed Pandas Dataframe with information below and wrote it to a CSV file
- Name, Year, Club, Position, Nationality
- Get Club HTMLs
-
This step provided statistics about an individual player for a particular year
-
Created a Pandas Dataframe with 58 columns and wrote it to a CSV file
- Get Player HTMLs
- Used Selenium with Chromedriver because the dropdown bar would not update with specific URL.
- Saved HTMLs to "data/epl/epl_players/year/players/player"
- Example of player webpages:
- Parse Player HTMLs
- Used BeautifulSoup to extract all appropriate statistics from local HTML file
- Put this information into a Pandas Dataframe then wrote the dataframe to a CSV file
- Get Player HTMLs
- For a particular year, I now had two Pandas Dataframes that needed to be merged.
- Club level dataframe with 4 columns
- Player level dataframe with 58 columns
- Merged on Name, Year, Position, Nationality
- Iterate Steps 1-4 from 2006 to 2018 concatenating each resulting dataframe
- This is the annual range that had consistent statistics fields for players
-
Resulting dataframe: 7473 rows x 59 columns
- Using only a subset of the dataframe where player made an appearance that year: 4750 rows x 59 columns
-
Columns:
- Global:
- 'Name', 'Year', 'Club', 'Position', 'Appearances', 'Wins', 'Losses', 'Nationality'
- Attack
- 'Goals', 'Headed Goals', 'Right Footed Goals', 'Left Footed Goals', 'Hit Woodwork', 'Goals per Match', 'Penalties Scored', 'Freekicks Scored', 'Shots', 'Shots on Target', 'Shooting Accuracy', 'Big Chances Missed'
- Defence
- 'Tackles', 'Blocked Shots', 'Interceptions', 'Clearances', 'Headed Clearances', 'Tackle Success', 'Recoveries', 'Duels Won', 'Duels Lost', 'Successful 50/50s', 'Aerials Battles Won', 'Aerial Battles Lost', 'Clean Sheets', 'Goals Conceded', 'Own Goals', 'Errors Lead to a Goal', 'Last Man Tackles', 'Clearances Off the Line'
- Team Play
- 'Assists', 'Passes', 'Passes per Game', 'Big Chances', 'Crosses', 'Cross Accuracy', 'Through Balls', 'Accurate Long Balls'
- Discipline
- 'Yellows', 'Reds', 'Fouls', 'Offsides'
- Goalkeeping
- 'Goalie Goals', 'Saves', 'Penalties Saved', 'Punches', 'High claims', 'Catches', 'Sweeper Clearances', 'Throw Outs', 'Goal Kicks'
- Global: