edgi-govdata-archiving/web-monitoring-processing

_decode_body fails if body is empty

Mr0grog opened this issue · 0 comments

Sentry logged a fun divide by zero error today: https://sentry.io/environmental-data-governance-/diffing-server/issues/755564118/

Basically, in the diffing server’s _decode_body method, we try to determine if a body might have been binary by determining the ratio of encoding errors to the length of the entire byte stream. However, if the byte stream was 0-length, that obviously won’t work too well (see line 318):

def _decode_body(response, name, raise_if_binary=True):
encoding = _extract_encoding(response.headers, response.body) or 'UTF-8'
text = response.body.decode(encoding, errors='replace')
# If a significantly large portion of the document was totally undecodable,
# it's likely this wasn't text at all, but binary data.
if raise_if_binary and text.count('\ufffd') / len(text) > 0.25:
raise UndecodableContentError(f'The response body of `{name}` could not be decoded as {encoding}.')
return text

This should be a pretty straightforward fix and might pair well with #310.