/soccerdata-scraper

Scrape soccer data from Wikipedia across various European football leagues and perform interactive data visualizations on it.

Primary LanguagePythonMIT LicenseMIT

Basic Overview

soccerdata-scraper scrapes soccer data from Wikipedia across tier 1 European Football Leagues and makes interactive as well as interesting data visualizations from it.

Current available leagues for scraping and then visualizations are given below.

League Seasons Source
English Premier League 1992-93 to present https://en.wikipedia.org/wiki/Category:Premier_League_seasons
Spanish La Liga 1929-30 to present https://en.wikipedia.org/wiki/Category:La_Liga_seasons
Italian Serie A 1929-30 to present https://en.wikipedia.org/wiki/Category:Serie_A_seasons
German Bundesliga 1963-64 to present https://en.wikipedia.org/wiki/Category:Bundesliga_seasons

Requirements

Install the dependencies listed below manually or use requirements.txt

pip install -r requirements.txt

List of libraries apart from standard ones that are required to make soccerdata-scraper work correctly. Use of Python 3.7.x or higher and most recently available stable builds for libraries is recommended.

bs4

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

requests

Requests is an elegant and simple HTTP library for Python, built for human beings.

numpy

NumPy is the fundamental package for array computing with Python.

pandas

Powerful data structures for data analysis, time series, and statistics

plotly

An open-source, interactive graphing library for Python

cefpython3

GUI toolkit for embedding a Chromium widget in desktop applications

PIL

Python Imaging Library

Usage

After making sure all dependencies are installed correctly, execute main.py. If everything's right, a graphical interface window should pop up.

  1. Press START.

  2. Select a league from top bar.

  3. Click on Select Season drop down.

Output

A new window should open up which contains interactive visualizations for selected season's data. Click on sub headings in this window to expand them and view the respective visualizations inside them. All generated graphs can be interacted within this window. A complete sample interactive visualization report which was shown here, can be can be seen here.

Also all the visualization reports generated are stored in a html file and can be interacted again through a web browser or if only some visualizations are required, they are also stored separately in a html file and can be retrieved individually. Along with this all the scraped data is further parsed into a JSON file and stored, should you only need the data and not visualizations.

A new folder called dumps should appear in soccerdata-scraper directory or whatever you have named current directory. Its contents will be something like this.

All three folders will contains 4 sub folders one for each league.

Contents of graphs folder look something like this, after selecting a league.

After selecting the respective season folder, individual visualizations can be interacted with.

Contents of json folder after selecting a league look something like this. All the data used for visualization can be obtained from this files.

reports folder contains the all complete season wise interactive visualization reports for each league, as seen through our interface. It's contents after selecting a league should look something like this.

Note

While this has been extensively tested, some specific visualizations for some seasons might fail due to page changes or some other reasons. Even in such possibility, visualizations should still work for whatsoever data that was scraped and parsed without any issues.