Scrape state-level license and violations data with browser automation tools

a.k.a. "Using Playwright to scrape pages that require you to fill out forms"

IRE 2024, Los Angeles

Jonathan Soma, Knight Chair in Data Journalism at Columbia Journalism School

Contact: jonathan.soma@gmail.com / @dangerscarf / jonathansoma.com

What is this?

State-level agencies always have absolutely awful websites for accessing information. It's horrible, it's awful, it's tragic, and it's a total pain to try to kidnap the data through scraping.

In these examples we're going to try to scrape a wide, wide, wide selection of different sites, using common approaches you can hopefully adapt to your own "this site has data I want!" situations. For most of these you don't need to know anything except how to hold down shift and press enter.

Each example uses Playwright for scraping, a "browser automation tool" that controls a browser for you. This allows you to click, use dropdowns, all sorts of fun stuff. It's also very fulfilling because once you run the code it's like a secret computer ghost is controlling your browser (and doing all of your work for you).

It's also also great because unlike traditional scraping you can actually see what's going on with the page, which makes things a little more... accessible? friendly?

Note for Windows: For some reason only one particular version of one particular thing works on Windows, so you need to run pip install ipykernel==6.28.0 before running Jupyter. The explanation is long and complicated.

The examples

Texas Tow Trucks Licenses: dropdowns
Iowa Appraisal Management Companies: dropdowns, 'next page' buttons
New Jersey Perfusionists: dropdowns, numbered pages
Massachusetts Optometrists: dropdowns, downloading files
North Dakota Oil Wells: dropdowns, using every dropdown option
Maryland Locksmiths: text boxes, inspecting pages, lists of inputs (zip codes), back button
California Midwives: changing browsers, dropdowns, infinite scroll, combining with BeautifulSoup
Texas Tow Trucks Details: dropdowns, text fields, lists of inputs (licenses from a CSV), saving entire HTML page
Chicago Buildings: faking typing, lists of inputs (addresses), CSS selectors, clicking links that open new pages, updating dataframes,
Texas Medical Board Actions: text inputs, inspecting the page, manipulating dates, lists of inputs (dates)
Texas Medical Board Actions Details: text fields, clicking links, downloading files, changing browsers, lists of inputs (license numbers)
Ohio, a failure: dropdowns, lists of inputs, using every dropdown option, me giving up (ell oh ell?)

jsoma/ire24-scraping

Scrape state-level license and violations data with browser automation tools

IRE 2024, Los Angeles

What is this?

The examples

More links