edgi-govdata-archiving/web-monitoring-processing

DiffMatchPatch can’t handle null terminators

Mr0grog opened this issue · 0 comments

This is from Sentry: https://sentry.io/environmental-data-governance-/diffing-server/issues/755613537/events/35761652755/

It turns out we have a reasonable amount of malformed content with null bytes in the middle of it. Unfortunately, our super-fast C-implementation can’t handle that (not too big of a surprise, really). It throws a ValueError of differs.compute_dmp_diff():

def compute_dmp_diff(a_text, b_text, timelimit=4):
if (isinstance(a_text, str) and isinstance(b_text, str)):
changes = diff(a_text, b_text, checklines=False, timelimit=timelimit, cleanup_semantic=True, counts_only=False)
elif (isinstance(a_text, bytes) and isinstance(b_text, bytes)):
changes = diff_bytes(a_text, b_text, checklines=False, timelimit=timelimit, cleanup_semantic=True,
counts_only=False)
else:
raise TypeError("Both the texts should be either of type 'str' or 'bytes'.")

We should probably check for null terminators and replace them with something:

  • The unicode replacement character? (what we use for decoding errors)
  • The unicode null symbol? (fun, but too cute/too indecipherable for many users?)