edgi-govdata-archiving/web-monitoring-processing

Protect `html_text_dmp` and `html_source_dmp` with content-type checking and sniffing

Mr0grog opened this issue · 0 comments

The html_token diff type currently checks the content-type header and optionally sniffs to make sure it’s reasonably to try and diff its inputs:

raise_if_not_diffable_html(
a_text,
b_text,
a_headers,
b_headers,
content_type_options)

The html_text_dmp and html_source_dmp, which can both go wild and consume all server resources, need similar checks and heuristics:

  • For html_text_dmp, the semantics can be similar to html_token: we want to make sure what we’ve got seems like HTML.
  • For html_source_dmp, the semantics are different — we just want to make sure it doesn’t seem like binary content (e.g. PDF, image, zip file, Excel file, etc.).

Side note: would this be a useful, generalized binary heuristic?

probably_binary = decoded_string.count('\ufffd') / len(decoded_string) > 0.25

Backstory: This afternoon, after publishing edgi-govdata-archiving/web-monitoring-db#406, I queued up a day’s worth of changes to analyze. Unfortunately, this wound up causing the diffing server to do that terrible thing it does and totally lock up the VM it’s running on (see #154), only now that we are using Kubernetes, it took down every pod on its node :(

Happily, Sentry gets us some pretty good diagnostic info: https://sentry.io/environmental-data-governance-/db-prod/issues/722661459/

It turns out what happened was that some of the (earliest ↔︎ current) diffs involved version pairs where one was a PDF and the other was not. When the earliest version was a PDF that came from Versionista, and when that version was old enough not to have its content-type explicitly specified, it ran across the lame fall-back logic I wrote that just assumes it might be HTML and goes for it.

When the differ tries to diff a big binary blob, it’s a bad time. Asking it to do so used to be reasonably safe, though: if the content failed to decode as text, we’d raise an exception and stop. But we removed that and made decoding more lenient in #232. I thought we were still safe because we also do sniffing (added in #155), but that’s only on the html_token diff.

One important fix here is to make it less likely for these diffs to even try to do their work when they receive obviously binary data, hence this issue. (We also need resource limits and other things, but we should still sniff to avoid running amok in the first place.)

/cc @danielballan @jsnshrmn