edgi-govdata-archiving/web-monitoring-processing

Support invalid encoding `iso-8559-1`

Closed this issue · 2 comments

In https://sentry.io/environmental-data-governance-/diffing-server/issues/755653660/, we have some content that claims to be encoded as iso-8559-1. That’s a typo for iso-8859-1 (i.e. ASCII or latin-1).

Since iso-8559-1 is actually a clothing size standard, I think we can safely map this to the correct encoding :)

Side note: @danielballan @jsnshrmn any thoughts on whether we should fall back to trying ASCII (or maybe UTF-8) for unknown encodings? It would have succeeded in this particular case.

That said, I think we’d still want to log the invalid encoding if we fell back to something else (so we can see in the logs if there’s an encoding we should actually be supporting, as opposed to typos like this).

👍 I like the idea of attempting ASCII decoding as a last ditch, dumb thing to try (when we're sure something isn't binary) before giving up.