SpamExperts/pyzor

Non-breaking spaces

birkett83 opened this issue · 0 comments

I recently received a spam containing a non-breaking space (encoded as =C2=A0 in quoted-printable UTF-8 if that is relevant). When running pyzor predigest, the non-breaking space is kept in the predigest output. I have no idea if spammers do this but they could randomly replace spaces with non-breaking spaces before sending mail to generate a different fingerprint each time and evade detection.

I believe that simply changing

    ws_ptrn = re.compile(r'\s')

to

    ws_ptrn = re.compile(r'\s', flags=re.UNICODE)

would address this (including all the other unicode space characters), but at the cost of breaking compatibility with signatures from older versions of pyzor.