edgi-govdata-archiving/web-monitoring-processing

Compare URLs with any HTTP status

vbanos opened this issue · 2 comments

Currently, after we try to download two target pages, we check the responses for errors. https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/diffing_server.py#L210
That means that if the pages don't have HTTP status=200, we cannot run diff.

Its possible that we would want to compare two 404 pages. E.g. the Wayback Machine contains capture URLs with any status. We suggest to add an optional URL param like accept_status=200,403,404.

A valid concern about this feature is that we might compare Wayback error pages if an error occurs and not captured 404 pages.
That's why we also need to check if HTTP header Memento-Datetime is present to make sure we are handling captures.

I’ve been thinking about this, and involving the Memento-Datetime header makes it a lot more complex, and I’m not sure there’s a good way to make it abstract. I’m thinking either:

  • Have an option for accept_status as you’re suggesting, and just not worry whether a request error was a memento or an actual failure.

  • Have an option for accept_memento (since Mementos are used by several archives, I think it would be OK to have specialized support for it) that would accept non-2xx response codes only if they have the Memento-Datetime header.

Given that we have real concerns about making this kind of diff server publicly available (since it can be used as a DDoS amplification vector), I think we probably don’t expect this to be an option that would be different from request to request, so it might make more sense to have as a configuration option instead of a querystring param.

Thoughts?

I have implemented a simple conf setting to allow processing of URLs with any status code. #451