edgi-govdata-archiving/web-monitoring-processing

Import script should include client-side redirects

Mr0grog opened this issue · 1 comments

When we can detect client-side redirects, we should include the target of the redirect in our imports.

I’m not sure whether the right approach is to:

  1. Treat them the same as server-side redirects (so we add the redirect to the memento’s history, and only save and hash the URL of the ultimate target) or

  2. Add some info about it to source_metadata and then capture the target of the redirect separately. The tricky bit here is that we probably need to separate these from the normal imports, because they need to tell the server to create new pages for them if the pages don’t already exist. (Normally when importing known pages, we tell the server not to create new pages, because Wayback searches can generate a lot of false hits for different URLs. This is less common today, though, since we no longer do domain prefix queries because they have bugs in CDX searches — see #550.)

  3. Really simple: just flag the redirect to the console as a warning so we can optionally import the redirect targets later in a separate run. This works around issues like the target having a bunch of tracking keys shoved into the querystring, making it hard to match up a redirect to the actual canonical URL of the target page.

I suspect the second approach above is better in terms of data — it’s most similar to how we have things today. It does make it tough to link things up, though. The third approach might be a simpler version of that for now and could give us something to build from.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.