kdeldycke/mail-deduplicate

Use Flanker to improve parsing of badly encoded mail

kdeldycke opened this issue · 3 comments

Detection and skipping of badly encoded mails has been added in #47. We can go further and improve parsing of these mails.

Flanker is a good effort to better parse email content in Python: https://github.com/mailgun/flanker . The idea would be to reuse its parsing utilities.

Unfortunately, Flanker doesn't target Python 3: mailgun/flanker#106 . Some efforts are made on side branches and forks to make Flanker Python 3 ready: mailgun/flanker#106 (comment) . But reconciliation of these forks is unlikely to happen. 😢

Digging deeper in Flanker's code, it seems that the later rely on the chardet module to magically guess badly encoded mails. This is strangely familiar with @kdmurray91 approach from #36.

So if we can't go the Flanker route, maybe we can reuse #36 code to enhance mail parsing.

Flanker is now Python3-compatible. But we will close this issue for now while we wait for community feedbacks following the revival of this project.

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.