18F/site-scanning

create a new target URL list

Closed this issue · 4 comments

We need to reconstitute a process for building the target URL list.

Update: Here's the latest iteration of how we see this issue's work

  • Assemble updated data sources
  • MVP a workflow for combining and processing the data
  • Review the MVP's product (the first new pass at the target URL list). Criteria: contains the necessary fields, contains only federal websites, removes duplications, removes masses of staging or non-website URLs
  • Based upon this review, decide what changes to the data assembly/processing workflow are required
  • Build the solution (in this case a blacklist of URLs) - Gray
  • Implement and document required changes
  • Upload it to become the target URL list used by the production system.
  • Check back later to see how the system did with the new list. Criteria: Whether the full target URL list was scanned (as opposed to the system hanging before getting to the end of the list); if the proportion of fields that are missing or erroneous has gone up dramatically.

Links:

Note - be sure to factor in the ones that Dawn from Search.gov shared with us.

Also, weigh EOT2020 over EOT2016.

Notes for Lauren's sheet:

  • addressing too long file

FYI, here's a sheet that's getting pretty close to what we need.

I'm working on the ignore list/blacklist creation aspect in #912