GSA/site-scanning

Create a new website index (target URL list)

Closed this issue · 2 comments

We need to reconstitute a process for building the target URL list.

Update: Here's the latest iteration of how we see this issue's work

  • Assemble updated data sources
  • MVP a workflow for combining and processing the data
  • Review the MVP's product (the first new pass at the target URL list). Criteria: contains the necessary fields, contains only federal websites, removes duplications, removes masses of staging or non-website URLs
  • Based upon this review, decide what changes to the data assembly/processing workflow are required
  • Build the solution (in this case a blacklist of URLs) - Gray
  • Implement and document required changes
  • Upload it to become the target URL list used by the production system.
  • Check back later to see how the system did with the new list. Criteria: Whether the full target URL list was scanned (as opposed to the system hanging before getting to the end of the list); if the proportion of fields that are missing or erroneous has gone up dramatically.

Links:

Picking this back up, our plan is to:

  • Review the process that we've built so far.
  • Talk through it and decide on any changes we want to make.
  • Re-run it with fresh data.
  • And push to production a fresh website index, even if it has a lot of extra URLs that we want to later exclude.