/ire24-scraping

Tutorials for IRE 2024 about using Playwright to scrape state-level license and violations data

Primary LanguageJupyter NotebookOtherNOASSERTION

Scrape state-level license and violations data with browser automation tools

a.k.a. "Using Playwright to scrape pages that require you to fill out forms"

IRE 2024, Los Angeles

Jonathan Soma, Knight Chair in Data Journalism at Columbia Journalism School

Contact: jonathan.soma@gmail.com / @dangerscarf / jonathansoma.com

What is this?

State-level agencies always have absolutely awful websites for accessing information. It's horrible, it's awful, it's tragic, and it's a total pain to try to kidnap the data through scraping.

In these examples we're going to try to scrape a wide, wide, wide selection of different sites, using common approaches you can hopefully adapt to your own "this site has data I want!" situations. For most of these you don't need to know anything except how to hold down shift and press enter.

Each example uses Playwright for scraping, a "browser automation tool" that controls a browser for you. This allows you to click, use dropdowns, all sorts of fun stuff. It's also very fulfilling because once you run the code it's like a secret computer ghost is controlling your browser (and doing all of your work for you).

It's also also great because unlike traditional scraping you can actually see what's going on with the page, which makes things a little more... accessible? friendly?

Note for Windows: For some reason only one particular version of one particular thing works on Windows, so you need to run pip install ipykernel==6.28.0 before running Jupyter. The explanation is long and complicated.

The examples

More links

If you'd like a general-purpose introduction to Playwright try this one, and if you want to know how to break CAPTCHAs, here you go!

The Playwright documentation is also pretty good.

And completely unrelated but very popular is my Practical AI for Investigative Journalism.