/NFL-Stats-Web-Scrape

This web scraper gathers basic statistics and career statistics provided by the NFL on their official website for all active players and all 40,000+ retired players.

Primary LanguagePython

NFL-Stats-Web-Scrape

This web scraper gathers basic statistics and career statistics provided by the NFL on their official website for active players and all 40,000+ retired players. Complete dataset is available here: https://www.kaggle.com/trevyoungquist/2020-nfl-stats-active-and-retired-players

How to run the code

After installing the requirements from requirements.txt, open your preferred terminal, navigate into the Scraper directory and type the command "python Activate.py"

Overall, the entire web scraping process of gathering individual player links, basic stats, and career stats should take most of the day to complete, depending on your computer and internet speed/connection.

Summary

There are three parts to this code:

  1. Gathering individual player links, in order to access their profile page
  2. Gathering basic stats for each player (ex. name, college, height, weight, etc)
  3. Gathering career stats for each player

Files and Explanation

  • Activate.py
  • Players.py
  • Urls.py
  • Scraper_Functions
    1. Basic_Stats_Functions
    2. Career_Stats_Functions.py
    3. CSV_Handler.py
    4. Gather_Player_Urls.py

Activate.py

This is where the core functions of the code are executed, gathering and processing query urls, player urls, and creating the appropriate CSV files to store data.

Classes: Players.py and Urls.py

In Players.py, three classes are used to store player data: ActivePlayer, RetiredPlayer, and Player_CareerStats. ActivePlayer and RetiredPlayer are initialized once, so that only one instance is used in the data scraping process, and are used to store and process basic stats of individual players. Player_CareerStats is initialized every time a new player's career stats are processed, both active and retired. If any player does not have any stats tables present in their webpage, then they will be skipped and thus not recorded in any career stats CSV file.

The Urls class holds all player query links for active players and retired players, and aids in processing those links and other links.

Gather_Player_Urls.py

Only two functions are in this file, one of which is essential for all other functions in the code. The Extract_Individual_Player_Links function gathers all individual player links from each player query page. The links for each query page are located in the Urls class. Each query page holds 100 individual player links.

The Get_HTML_Document function gathers the "soup" of all links pasted through this function. This function is used often throughout the code.

CSV_Handler.py

All functions creating, appending, and writing CSV files are held in this file.

Basic_Stats_Functions.py

This file handles extracting basic profile stats for all players, but active players and retired players are processed in separate functions. A lot of information available for active players are not available for retired players, due to the massive amount of retired players as compared to active players. Also, the code records whether or not a retired player is in the Hall Of Fame.

Career_Stats_Functions.py

These functions handle both active and retired players. Most likely the sloppiest code in this web scraper is in the function Player_Stats, even though it works as expected. Each player's stats table has no particular class or identifier separating, for example, a "Passing" table from a "Punting" table from player to player (it makes more sense when you compare the HTML source code of two or more players with each other). So this was the best way I found at the time to get around that obstacle.

If a player does not have any stats tables in their webpage, then the player is not recorded in the career stats CSV files.