Create a new website index (target URL list)
Closed this issue · 2 comments
gbinal commented
We need to reconstitute a process for building the target URL list.
Update: Here's the latest iteration of how we see this issue's work
- Assemble updated data sources
- MVP a workflow for combining and processing the data
- Review the MVP's product (the first new pass at the target URL list). Criteria: contains the necessary fields, contains only federal websites, removes duplications, removes masses of staging or non-website URLs
- Based upon this review, decide what changes to the data assembly/processing workflow are required
- Build the solution (in this case a blacklist of URLs) - Gray
- Implement and document required changes
- Upload it to become the target URL list used by the production system.
- Check back later to see how the system did with the new list. Criteria: Whether the full target URL list was scanned (as opposed to the system hanging before getting to the end of the list); if the proportion of fields that are missing or erroneous has gone up dramatically.
Links:
gbinal commented
Picking this back up, our plan is to:
- Review the process that we've built so far.
- Talk through it and decide on any changes we want to make.
- Re-run it with fresh data.
- And push to production a fresh website index, even if it has a lot of extra URLs that we want to later exclude.
gbinal commented