Are there any web scrapers that need to be written?
Closed this issue · 2 comments
Are there any web scrapers that need to be written for this repo?
@sisiwei and I are teaching a web scraping class @ Pycon and we are looking for possible challenges for folks to dig into.
Our current scope is the 70+ federal IGs under the scope of the Council of Inspectors General, that publish reports online in a scrape-able place. We also have the House of Representatives IG in there.
There are a few places we could happily expand to cover more areas of oversight, I think.
- The Government Accountability Office, a legislative branch agency that performs oversight on the executive branch. They also are asked to review IGs themselves now and then. They publish reports, and have an undocumented API which I once used to write a Ruby scraper that sits inside the belly of a larger app that serves a search engine. Re-implementing this in Python, and in a standalone project like this, will expand the reach of their work.
- The House Oversight Committee, amidst all the politics, produces a lot of meaningful oversight reports. I don't know how to think of the minority reports, but they exist too.
- The Senate Committee on Homeland Security and Governmental Affairs is the Senate's oversight committee. They publish reports too, though that link is filtering it down to "Reports" in the 113th Congress (none in 114th yet). Also, this page links to a PDF of investigation links by the investigations subcommittee.
- Also, GPO keeps a massive historical collection of committee reports that can out of date by a year, but is the only source of historical work.
- The US Navy has an inspector general, with a two-page reading room, I think with only reports that came out of FOIA.
- The US Marines inspector general has a messy website that might have reports buried either inside, or on the sidebar.
- I can't find any reports for the inspectors general for the US Army or the US Air Force, but maybe you'll have better luck than me.
Additionally, you could look over the IGs for whom we don't have any reliable report locations, and either verify that this is the case, or identify some reputable third party sources where we might find some. Those IGs are:
- Architect of the Capitol
- Capitol Police
- Central Intelligence Agency
- Defense Intelligence Agency
- National Geospatial-Intelligence Agency
- National Reconnaissance Office
- National Security Agency
- Intelligence Community (ODNI)
- Special Inspector General for Iraq Reconstruction
We'd also love it if people could identify any room for improvement in existing scrapers, or felt moved to tackle any of the open issues.
Thanks for inquiring, and for being interested in the project for your class! This was a useful thing to write down.
@konklone this helpful! Thank you!