medic/cht-docs

Add scripts/CI to check for `404` URLs after a big move

Opened this issue · 1 comments

Hugo does a great job ensuring all pages that link to each other internally don't 404. However, for large moves like we did recently with forms, we may 404 a number of inbound links from other sources, or bookmarks folks have. To ensure these don't break, it's nice to generate a list of all known URLs on main, do a big move, and then check that all the known URLs safely redirect.

Two scripts were written already which we may choose to repurpose - but likely this should be:

  • rewritten in node
  • run in CI and block a merge if it fails
  • allow users to run locally so they don't have to wait for CI

Ok! I did some exploratory research and here's what I think the rough structure is - open to input though! For every PR that wants to merge to main, CI will:

  1. build a version of the site based off the branch - see how we do this already for a weekly link check
  2. get every current URL by downloading the site map from production
  3. using curl for npm - download every page on the branch build running in the CI hugo server
  4. check the response and HTML for each:
    • 200 response - if yes, check if it has a http-equiv="refresh" in the HTML and that this in turn has a 200 (recursive 'til no meta refresh?)
    • 404 response - note the page has a 404 and should be instead have an alias (meta refresh)

the site map saves us quite a bit of recursion!