/wiki-table-scrape

Scrape tables from Wikipedia articles into CSVs

Primary LanguagePythonMIT LicenseMIT

wiki-table-scrape

Scrape all the tables from a Wikipedia article into a folder of CSV files.

You can read more about it in the blog post

Installation

This is a Python 3.5 module that depends on the Beautiful Soup and requests packages.

  1. Clone and cd into this repo.
  2. Install Python 3.5.
  3. Install requirements from pip with pip install -r requirements.txt.
  4. If on Windows, download the .whl for the lxml parser and install it locally.
  5. Test the program by running python test_wikitablescrape.py.

Usage

Just import the module and call the scrape function. Pass it the full URL of a Wikipedia article, and a simple string (no special characters or filetypes) for the output name. The output will all be written to the output_name folder, with files named output_name.csv, output_name_1.csv, etc.

import wikitablescrape

wikitablescrape.scrape(
    url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    output_name="films"
)

Inspecting the output with Bash gives the following results:

$ ls films/
films.csv  films_1.csv  films_2.csv  films_3.csv

$ cat films/films_1.csv
"Rank","Title","Worldwide gross (2014 $)","Year"
"1","Gone with the Wind","$3,440,000,000","1939"
"2","Avatar","$3,020,000,000","2009"
"3","Star Wars","$2,825,000,000","1977"
"4","Titanic","$2,516,000,000","1997"
"5","The Sound of Music","$2,366,000,000","1965"
"6","E.T. the Extra-Terrestrial","$2,310,000,000","1982"
"7","The Ten Commandments","$2,187,000,000","1956"
"8","Doctor Zhivago","$2,073,000,000","1965"
"9","Jaws","$2,027,000,000","1975"
"10","Snow White and the Seven Dwarfs","$1,819,000,000","1937"

Disclaimers

The script won't give you 100% clean data for every page on Wikipedia, but it will get you most of the way there. You can see the output from the pages for mountain height, volcano height, NBA scores, and the highest-grossing films in the output folder of this repo.

I only plan to add features to this module as I need them, but if you would like to contribute, please open an issue or pull request.

If you'd like to read more about this module, please check out my blog post.