SpamExperts/pyzor

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9... invalid start byte

Closed this issue · 2 comments

With pyzor-1.0.0, we're seeing a few crashes every day through SpamAssassin/amavisd-new. I tracked one of them down to the following error,

Traceback (most recent call last):
 File "/usr/lib/python-exec/python3.5/pyzor", line 408, in <module> main()
 File "/usr/lib/python-exec/python3.5/pyzor", line 152, in main
 if not dispatch(client, servers, config):
 File "/usr/lib/python-exec/python3.5/pyzor", line 237, in check
 for digested in get_input_handler(style):
 File "/usr/lib/python-exec/python3.5/pyzor", line 174, in _get_input_msg
 msg = email.message_from_file(sys.stdin)
 File "/usr/lib64/python3.5/email/__init__.py", line 54, in message_from_file
 return Parser(*args, **kws).parse(fp)
 File "/usr/lib64/python3.5/email/parser.py", line 54, in parse
 data = fp.read(8192)
 File "/usr/lib64/python3.5/codecs.py", line 321, in decode
 (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 6527: invalid start byte

In emacs, that character looks like the copyright symbol. The text/plain portion of the killer message (where the bad character lives) is supposed to be iso-8859-1 which defines the copyright symbol (in hex) as 00A9, so that checks out.

I'm too tired at the moment to figure out why pyzor is trying to decode iso-8859-1 as utf-8, but maybe it will be obvious to someone else (if it's even pyzor's fault).

This is actually due to python not being able to decode stdin. The following patch seems to work on this one problematic message I have... I'll do more testing when I'm able:

diff --git a/scripts/pyzor b/scripts/pyzor
index 567a7f9..92df716 100755
--- a/scripts/pyzor
+++ b/scripts/pyzor
@@ -171,7 +171,11 @@ def _get_input_digests(dummy):


 def _get_input_msg(digester):
-    msg = email.message_from_file(sys.stdin)
+    maintype = 'application'
+    subtype = 'octet-stream'
+    ctype = maintype + '/' + subtype
+    msg = email.message.EmailMessage()
+    msg.set_content(get_binary_stdin().read(), maintype, subtype)
     digested = digester(msg).value
     yield digested

For whatever reason, changing email.message_from_file to email.message_from_bytes directly does not work.

Nevermind all that, I screwed up the test. I think it may be as simple as,

diff --git a/scripts/pyzor b/scripts/pyzor
index 567a7f9..86c6f7d 100755
--- a/scripts/pyzor
+++ b/scripts/pyzor
@@ -171,7 +171,7 @@ def _get_input_digests(dummy):


 def _get_input_msg(digester):
-    msg = email.message_from_file(sys.stdin)
+    msg = email.message_from_bytes(get_binary_stdin().read())
     digested = digester(msg).value
     yield digested