UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9... invalid start byte
Closed this issue · 2 comments
With pyzor-1.0.0, we're seeing a few crashes every day through SpamAssassin/amavisd-new. I tracked one of them down to the following error,
Traceback (most recent call last):
File "/usr/lib/python-exec/python3.5/pyzor", line 408, in <module> main()
File "/usr/lib/python-exec/python3.5/pyzor", line 152, in main
if not dispatch(client, servers, config):
File "/usr/lib/python-exec/python3.5/pyzor", line 237, in check
for digested in get_input_handler(style):
File "/usr/lib/python-exec/python3.5/pyzor", line 174, in _get_input_msg
msg = email.message_from_file(sys.stdin)
File "/usr/lib64/python3.5/email/__init__.py", line 54, in message_from_file
return Parser(*args, **kws).parse(fp)
File "/usr/lib64/python3.5/email/parser.py", line 54, in parse
data = fp.read(8192)
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 6527: invalid start byte
In emacs, that character looks like the copyright symbol. The text/plain portion of the killer message (where the bad character lives) is supposed to be iso-8859-1 which defines the copyright symbol (in hex) as 00A9, so that checks out.
I'm too tired at the moment to figure out why pyzor is trying to decode iso-8859-1 as utf-8, but maybe it will be obvious to someone else (if it's even pyzor's fault).
This is actually due to python not being able to decode stdin
. The following patch seems to work on this one problematic message I have... I'll do more testing when I'm able:
diff --git a/scripts/pyzor b/scripts/pyzor
index 567a7f9..92df716 100755
--- a/scripts/pyzor
+++ b/scripts/pyzor
@@ -171,7 +171,11 @@ def _get_input_digests(dummy):
def _get_input_msg(digester):
- msg = email.message_from_file(sys.stdin)
+ maintype = 'application'
+ subtype = 'octet-stream'
+ ctype = maintype + '/' + subtype
+ msg = email.message.EmailMessage()
+ msg.set_content(get_binary_stdin().read(), maintype, subtype)
digested = digester(msg).value
yield digested
For whatever reason, changing email.message_from_file
to email.message_from_bytes
directly does not work.
Nevermind all that, I screwed up the test. I think it may be as simple as,
diff --git a/scripts/pyzor b/scripts/pyzor
index 567a7f9..86c6f7d 100755
--- a/scripts/pyzor
+++ b/scripts/pyzor
@@ -171,7 +171,7 @@ def _get_input_digests(dummy):
def _get_input_msg(digester):
- msg = email.message_from_file(sys.stdin)
+ msg = email.message_from_bytes(get_binary_stdin().read())
digested = digester(msg).value
yield digested