Scrape lobbyist employer expenditures
Closed this issue · 5 comments
The current lobbyist scrape captures employer name (ClientName), but there is some additional metadata we could capture: https://docs.google.com/spreadsheets/d/1-c2Ony5hGjpchOfwPKhYyoJ9FPji0dfkgt0mTcBS9mo/edit?usp=sharing
Clarify with Marjorie which employer metadata lobbyist scrapes should include, then capture it.
@antidipyramid I'll email Marjorie to see what, if any, additional data about the employers she wants in the scrapes.
🚨 Glad I asked! We need to scrape the lobbyist employer expenditures from https://login.cfis.sos.state.nm.us/#/lobbyistexpendituresearch/31.
@hancush There is a good amount of data processing going on in lobbyists.mk
.
Do we want to do some kind of processing on the employer expenditures?
Great question, @antidipyramid. Some context on lobbyist (and lobbyist employer) scraping: The search interface does not include one very important piece of information: the beneficiary of the expenditure / contribution. So, the original lobbyist scrape downloads all of a lobbyist's filings, then parses information out of those PDFs.
It looks like lobbyist employers file the same information in the same format, e.g., https://login.cfis.sos.state.nm.us//ReportsOutput//LAR/4a27c051-7b49-456a-9936-98d595384a08.pdf
I wonder if we could simply plug them into the existing pipeline (perhaps with some modifications, since rather than a lobbyist associated with a client [employer], there will only be clients [employers])?
Looks like there's an https://login.cfis.sos.state.nm.us/api//ExploreClients/Disclosures
endpoint that gets filings for lobbyist employers (while it's https://login.cfis.sos.state.nm.us/api//ExploreClients/Fillings
for lobbyists) – see the network request when you click on "Filings" here: https://login.cfis.sos.state.nm.us/#/exploreClientDetailPublic/mDJ2oXreU_grMhUIIWBeHHwquY7yN_7SNrmbDh6rMxI1/10/2024
If you can modify the script that retrieves filings so it works for both lobbyists and lobbyist employers, I think you can use the rest of the pipeline (PDF parsing) as is, or close to it! What do you think?