damargulis/ClimateData

Crawling and preserving climate data

Python

ClimateData

Crawling and preserving climate data

Dockets:

TODO:

Check each possible doc type and ensure its saved properly
Find/Create database to store everything, automate saving crawls directly to database (Archivers app?)
Hash documents for data integrity
Check each document type is saved correctly

In Progress:

Use box api to autoupload documents --Zach

Review/Testing:

Finished:

For each docket, start a crawl for Primary, Supporting, and Comments
Get all links to document pages
Automate finding how many documents in each docket
Get links for each document and attachment on doc pages
Download and save each document
Find and download all pictures embedded on the page
Get document metadata (id, tracking number, date posted, RIN, etc)
Download information in the "show more details" section
Log all errors to file so they can be checked later