/scraping-workshop

Tutorial on web scraping with Python and PDFs

Primary LanguageJupyter Notebook

Scraping Workshop

This is a workshop that teaches how to use Python to create a dataset by scraping a website.

This entails parsing HTML, downloading PDFs, and extracting data from PDFs.

Installation

Install JupyterLab if necessary (you can use a virtual environment). I set this up with Python3.10.

pip install -r requirements.txt

You can then run the jupyter-lab server.

Running the workshop

Just open the notebook in JupyterLab, it explains everything.

Backups

It's possible the source website will change or disappear entirely. It's archived in the bak/web directory. All the PDFs that should be downloaded are in bak/raw. A sample "final product" CSV is also included in the bak/data directory.