Figure out how to handle bad gzipped content from Wayback
Mr0grog opened this issue · 6 comments
From this Sentry error: https://sentry.io/environmental-data-governance-/diffing-server/issues/755625052/activity/
This is being caused by an error on Wayback’s side. See https://internetarchive.slack.com/archives/C92PYHLCE/p1541621325015700
The original URL that was scraped here is http://cwcgom.aoml.noaa.gov/erddap/griddap/miamiacidification.graph, which returns gzipped content (with a proper Content-Encoding: gzip
header). (The page in our DB is: https://api.monitoring.envirodatagov.org/api/v0/pages/5079d1eb-a7f2-4209-975c-c585d3ab6a74 and an example of a bad response from Wayback is: http://web.archive.org/web/20181023233237id_/http://cwcgom.aoml.noaa.gov/erddap/griddap/miamiacidification.graph)
When Wayback serves up a capture, though, it sends a content-encoding:
header (note that is both lower-case and blank). It might also be serving a proper Content-Encoding: gzip
header later; we’d have to dig in more to see (I’m getting different results in curl and Chrome, but neither than figure out that this is supposed to be gzipped content and decode it).
I’d guess this is systemic and affecting a lot more things. Needs some feedback from Wayback folks on whether this is a known issue and how best to address it (of if they can fix it quick and we should just leave it be).
Wayback folks are unsure about a timeline for a fix. In the mean time, their suggestion is to stop accepting gzip
encoding and load everything normally. So when making memento requests, instead of:
response = self.session.request('GET', url, allow_redirects=False)
do:
response = self.session.request('GET', url, allow_redirects=False, headers={'Accept-Encoding': 'identity'})
Need to test and make sure that’s right. It appears to work in cURL.
Update: we actually aren’t having problems with this Wayback bug in this project because we have been accidentally squashing the default Accept-Encoding: gzip, deflate
header when we set the user-agent! It turns out the problem was happening over in -db, where the Archiver
module re-downloads the content to store in S3.
That said, with the amount of data we are pulling across (we have to download and hash every prospective version), I kind of want to work around this here so we can turn on gzipping. My terrifyingly ugly fix so far:
from urllib3._collections import HTTPHeaderDict
_header_dict_init = HTTPHeaderDict.__init__
def _new_header_dict_init(self, headers=None, **kwargs):
if headers is not None:
if ('content-encoding', '') in headers and ('Content-Encoding', 'gzip') in headers:
headers = [item for item in headers if item[0] != 'content-encoding']
return _header_dict_init(self, headers, **kwargs)
HTTPHeaderDict.__init__ = _new_header_dict_init
Pretty ugly, but concise and reasonably well targeted. There’s a slightly cleaner place to do this — you can set HTTPConnectionPool.ResponseCls
to your own subclass of urllib3’s Response
class and clean up the underlying httplib headers object before it gets re-parsed… except I couldn’t figure out how to modify it — none of the changes I made took. So I settled on the above. For reference, if anybody can see an obvious mistake, here’s what I tried there:
# This doesn't work -- my edits to the httplib_response.msg object don’t take
# like the docs imply they should
from urllib3.connectionpool import HTTPConnectionPool
ActualResponse = HTTPConnectionPool.ResponseCls
class WaybackResponse(ActualResponse):
@classmethod
def from_httplib(cls, httplib_response, **response_kwargs):
headers = httplib_response.msg
print(f'headers class: {headers.__class__}')
# Both of these get ''
print(f'Lower encoding: {headers.get("content-encoding", "not found")}')
print(f'Upper encoding: {headers.get("Content-Encoding", "not found")}')
# ...so obviously this is never going to work
if headers.get('content-encoding', None) == '' and headers.get('Content-Encoding', None) == 'gzip':
del headers['content-encoding']
# But neither does this, either
del headers['content-type']
return ActualResponse.from_httplib(httplib_response, **response_kwargs)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
This is already solved in an in-progress PR that just needs to be cleaned up and merged.