GAO's own reports
Opened this issue · 4 comments
Not the GAO IG, but the GAO itself, who publishes an amazing number of excellent reports.
There are four interesting datasets, with two known existing scrapers:
- Reports, for which I (at Sunlight) wrote a Ruby scraper
- Bid protest decisions, for which @vzvenyach wrote a Python scraper
- Restricted reports, which is a new dataset and worth including as unreleased reports
For both, in their current state I'd recommend porting them over here, rather than adding a wrapper around them or something. Perhaps we can convince @vzvenyach to move his efforts here too!
What's the third dataset?
Whoops! I updated the issue with it. It's the restricted reports.
I'm working on a scraper that will do GAO reports and restricted reports.
There is some stuff dealing with citations in the Ruby parser. I'm assuming that can be omitted.
GAO usually provides "accessible text" .txt versions, which the Ruby parser uses to avoid pdftotext'ing. I will include the .txt URL in the json, but I don't think inspectors-general provides a way to manually give the text that should hit elasticsearch, so it can just process the PDFs as normal.