edgi-govdata-archiving/web-monitoring-processing

Figure out how to handle bad gzipped content from Wayback

Mr0grog opened this issue · 6 comments

From this Sentry error: https://sentry.io/environmental-data-governance-/diffing-server/issues/755625052/activity/

This is being caused by an error on Wayback’s side. See https://internetarchive.slack.com/archives/C92PYHLCE/p1541621325015700

The original URL that was scraped here is http://cwcgom.aoml.noaa.gov/erddap/griddap/miamiacidification.graph, which returns gzipped content (with a proper Content-Encoding: gzip header). (The page in our DB is: https://api.monitoring.envirodatagov.org/api/v0/pages/5079d1eb-a7f2-4209-975c-c585d3ab6a74 and an example of a bad response from Wayback is: http://web.archive.org/web/20181023233237id_/http://cwcgom.aoml.noaa.gov/erddap/griddap/miamiacidification.graph)

When Wayback serves up a capture, though, it sends a content-encoding: header (note that is both lower-case and blank). It might also be serving a proper Content-Encoding: gzip header later; we’d have to dig in more to see (I’m getting different results in curl and Chrome, but neither than figure out that this is supposed to be gzipped content and decode it).

I’d guess this is systemic and affecting a lot more things. Needs some feedback from Wayback folks on whether this is a known issue and how best to address it (of if they can fix it quick and we should just leave it be).

Wayback folks are unsure about a timeline for a fix. In the mean time, their suggestion is to stop accepting gzip encoding and load everything normally. So when making memento requests, instead of:

response = self.session.request('GET', url, allow_redirects=False)

do:

response = self.session.request('GET', url, allow_redirects=False, headers={'Accept-Encoding': 'identity'})

Need to test and make sure that’s right. It appears to work in cURL.

Update: we actually aren’t having problems with this Wayback bug in this project because we have been accidentally squashing the default Accept-Encoding: gzip, deflate header when we set the user-agent! It turns out the problem was happening over in -db, where the Archiver module re-downloads the content to store in S3.

That said, with the amount of data we are pulling across (we have to download and hash every prospective version), I kind of want to work around this here so we can turn on gzipping. My terrifyingly ugly fix so far:

from urllib3._collections import HTTPHeaderDict

_header_dict_init = HTTPHeaderDict.__init__

def _new_header_dict_init(self, headers=None, **kwargs):
    if headers is not None:
        if ('content-encoding', '') in headers and ('Content-Encoding', 'gzip') in headers:
            headers = [item for item in headers if item[0] != 'content-encoding']
    return _header_dict_init(self, headers, **kwargs)

HTTPHeaderDict.__init__ = _new_header_dict_init

Pretty ugly, but concise and reasonably well targeted. There’s a slightly cleaner place to do this — you can set HTTPConnectionPool.ResponseCls to your own subclass of urllib3’s Response class and clean up the underlying httplib headers object before it gets re-parsed… except I couldn’t figure out how to modify it — none of the changes I made took. So I settled on the above. For reference, if anybody can see an obvious mistake, here’s what I tried there:

# This doesn't work -- my edits to the httplib_response.msg object don’t take
# like the docs imply they should
from urllib3.connectionpool import HTTPConnectionPool
ActualResponse = HTTPConnectionPool.ResponseCls

class WaybackResponse(ActualResponse):
    @classmethod
    def from_httplib(cls, httplib_response, **response_kwargs):
        headers = httplib_response.msg
        print(f'headers class: {headers.__class__}')
        # Both of these get ''
        print(f'Lower encoding: {headers.get("content-encoding", "not found")}')
        print(f'Upper encoding: {headers.get("Content-Encoding", "not found")}')
        # ...so obviously this is never going to work
        if headers.get('content-encoding', None) == '' and headers.get('Content-Encoding', None) == 'gzip':
            del headers['content-encoding']
        # But neither does this, either
        del headers['content-type']
        return ActualResponse.from_httplib(httplib_response, **response_kwargs)

This is fixed in 0e896ee on the 86-import-known-db-pages-from-ia branch (PR #174).

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

This is already solved in an in-progress PR that just needs to be cleaned up and merged.

Fixed as part of #174, merged in b362860.