a.k.a. "Using Playwright to scrape pages that require you to fill out forms"
Jonathan Soma, Knight Chair in Data Journalism at Columbia Journalism School
Contact: jonathan.soma@gmail.com / @dangerscarf / jonathansoma.com
State-level agencies always have absolutely awful websites for accessing information. It's horrible, it's awful, it's tragic, and it's a total pain to try to kidnap the data through scraping.
In these examples we're going to try to scrape a wide, wide, wide selection of different sites, using common approaches you can hopefully adapt to your own "this site has data I want!" situations. For most of these you don't need to know anything except how to hold down shift and press enter.
Each example uses Playwright for scraping, a "browser automation tool" that controls a browser for you. This allows you to click, use dropdowns, all sorts of fun stuff. It's also very fulfilling because once you run the code it's like a secret computer ghost is controlling your browser (and doing all of your work for you).
It's also also great because unlike traditional scraping you can actually see what's going on with the page, which makes things a little more... accessible? friendly?
Note for Windows: For some reason only one particular version of one particular thing works on Windows, so you need to run
pip install ipykernel==6.28.0
before running Jupyter. The explanation is long and complicated.
- Texas Tow Trucks Licenses: dropdowns
- Iowa Appraisal Management Companies: dropdowns, 'next page' buttons
- New Jersey Perfusionists: dropdowns, numbered pages
- Massachusetts Optometrists: dropdowns, downloading files
- North Dakota Oil Wells: dropdowns, using every dropdown option
- Maryland Locksmiths: text boxes, inspecting pages, lists of inputs (zip codes), back button
- California Midwives: changing browsers, dropdowns, infinite scroll, combining with BeautifulSoup
- Texas Tow Trucks Details: dropdowns, text fields, lists of inputs (licenses from a CSV), saving entire HTML page
- Chicago Buildings: faking typing, lists of inputs (addresses), CSS selectors, clicking links that open new pages, updating dataframes,
- Texas Medical Board Actions: text inputs, inspecting the page, manipulating dates, lists of inputs (dates)
- Texas Medical Board Actions Details: text fields, clicking links, downloading files, changing browsers, lists of inputs (license numbers)
- Ohio, a failure: dropdowns, lists of inputs, using every dropdown option, me giving up (ell oh ell?)
If you'd like a general-purpose introduction to Playwright try this one, and if you want to know how to break CAPTCHAs, here you go!
The Playwright documentation is also pretty good.
And completely unrelated but very popular is my Practical AI for Investigative Journalism.