Compare URLs with any HTTP status
vbanos opened this issue · 2 comments
Currently, after we try to download two target pages, we check the responses for errors. https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/diffing_server.py#L210
That means that if the pages don't have HTTP status=200, we cannot run diff.
Its possible that we would want to compare two 404 pages. E.g. the Wayback Machine contains capture URLs with any status. We suggest to add an optional URL param like accept_status=200,403,404
.
A valid concern about this feature is that we might compare Wayback error pages if an error occurs and not captured 404 pages.
That's why we also need to check if HTTP header Memento-Datetime
is present to make sure we are handling captures.
I’ve been thinking about this, and involving the Memento-Datetime
header makes it a lot more complex, and I’m not sure there’s a good way to make it abstract. I’m thinking either:
-
Have an option for
accept_status
as you’re suggesting, and just not worry whether a request error was a memento or an actual failure. -
Have an option for
accept_memento
(since Mementos are used by several archives, I think it would be OK to have specialized support for it) that would accept non-2xx response codes only if they have theMemento-Datetime
header.
Given that we have real concerns about making this kind of diff server publicly available (since it can be used as a DDoS amplification vector), I think we probably don’t expect this to be an option that would be different from request to request, so it might make more sense to have as a configuration option instead of a querystring param.
Thoughts?