/ClimateData

Crawling and preserving climate data

Primary LanguagePython

ClimateData

Crawling and preserving climate data

Dockets:

TODO:

  • Check each possible doc type and ensure its saved properly
  • Find/Create database to store everything, automate saving crawls directly to database (Archivers app?)
  • Hash documents for data integrity
  • Check each document type is saved correctly

In Progress:

  • Use box api to autoupload documents --Zach

Review/Testing:

Finished:

  • For each docket, start a crawl for Primary, Supporting, and Comments
  • Get all links to document pages
  • Automate finding how many documents in each docket
  • Get links for each document and attachment on doc pages
  • Download and save each document
  • Find and download all pictures embedded on the page
  • Get document metadata (id, tracking number, date posted, RIN, etc)
  • Download information in the "show more details" section
  • Log all errors to file so they can be checked later