create a new target URL list
Closed this issue · 4 comments
gbinal commented
We need to reconstitute a process for building the target URL list.
Update: Here's the latest iteration of how we see this issue's work
- Assemble updated data sources
- MVP a workflow for combining and processing the data
- Review the MVP's product (the first new pass at the target URL list). Criteria: contains the necessary fields, contains only federal websites, removes duplications, removes masses of staging or non-website URLs
- Based upon this review, decide what changes to the data assembly/processing workflow are required
- Build the solution (in this case a blacklist of URLs) - Gray
- Implement and document required changes
- Upload it to become the target URL list used by the production system.
- Check back later to see how the system did with the new list. Criteria: Whether the full target URL list was scanned (as opposed to the system hanging before getting to the end of the list); if the proportion of fields that are missing or erroneous has gone up dramatically.
Links:
gbinal commented
Note - be sure to factor in the ones that Dawn from Search.gov shared with us.
Also, weigh EOT2020 over EOT2016.
Notes for Lauren's sheet:
- addressing too long file
gbinal commented
FYI, here's a sheet that's getting pretty close to what we need.
gbinal commented
Moving to GSA/site-scanning#32