"To", "From" or "Subject" sometimes None while analyzing the enron E-Mails
Mr-Mistoffelees opened this issue · 1 comments
Hi all,
at first thank you developing this great library which saves lots of time for me.
I'm actually working with the Enron Dataset. Usually, the eml_parser works like a charm. But there are some mails where both, the eml_parser and pythons email library, return None as value for "to", "from" or "subject", while Thunderbird displays the e-mail correctly (with from, to, subject, etc.).
I have to admit that many mails from the Enron Dataset are from 2001 or earlier.
Taking a random mail with three attachments:
Date: Tue, 20 Nov 2001 11:11:16 -0800 (PST)
Message-ID: <AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_186_25054702.1274862135875"
Microsoft Mail Internet Headers Version 2.0
X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0
content-class: urn:content-classes:message
Subject: FW: Things to do when the boss is out.
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator: <AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>
Thread-Topic: FW: Things to do when the boss is out.
Thread-Index: AcFWZ093VuF/8cJaEdWxIgBQi+MJ2QALiYaAAB4BjBAACOWUMAACiWdwAAM4b4AAAjOG4AapitwQ
From: "Ring Andrea" <Andrea.Ring@ENRON.com>
To: "Brawner Sandra F." <Sandra.F.Brawner@ENRON.com>
X-ZL-From: Ring, Andrea </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ARING>
X-ZL-To: Brawner, Sandra F. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbrawne>
X-ZL-Subject: FW: Things to do when the boss is out.
X-Filename: SBRAWNE (Non-Privileged).pst
X-Folder: \Inbox
X-SDOC: 1385384
X-ZLID: zl-edrm-enron-v2-brawner-s-1359.eml
X-ZL-Date: Tue, 20 Nov 2001 13:11:16 -0600
------=_Part_186_25054702.1274862135875
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
-----Original Message-----
From: Winckowski, Michele
Sent: Wednesday, October 17, 2001 4:16 PM
Subject: FW: Things to do when the boss is out.
- cubicle hurdles.mpeg
- Hallway races.mpeg
- Rowing.mpeg
------=_Part_186_25054702.1274862135875
Content-Type: application/octet-stream; name="cubicle hurdles.mpeg"
Content-Transfer-Encoding: base64
Content-Disposition: ATTACHMENT; filename="cubicle hurdles.mpeg"
AAABuiEAAQAXgArdAAABuwAMgArdBOH/4OAMwMAgAAABvgfcD///////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
------=_Part_186_25054702.1274862135875--
Trying to extract sender and recipient, I use this code...
import pathlib
sample = pathlib.Path('test/test3.eml')
with sample.open('rb') as fhdl:
raw_email = fhdl.read()
ep = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False)
parsed_eml = ep.decode_email_bytes(raw_email)
print(parsed_eml['header'])
which leads to the following outcome:
{'subject': '', 'defect': [''], 'to': [], 'date': datetime.datetime(2001, 11, 20, 11, 11, 16, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=57600))), 'received': [], 'header': {'message-id': ['<AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>'], 'date': ['Tue, 20 Nov 2001 11:11:16 -0800'], 'mime-version': ['1.0'], 'content-type': ['multipart/mixed; boundary="----=_Part_186_25054702.1274862135875"']}}
Same using the native python library (where the values I'm searching for are located in the 'text' variable)
My code:
from pathlib import Path
from email import policy
from email.parser import BytesParser
myfile = 'test/test3.eml'
with open(myfile, 'rb') as fp:
msg = BytesParser(policy=policy.default).parse(fp)
text = msg.get_body(preferencelist=('plain')).get_content()
fp.close()
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])
print("\n===")
print(text)
Output:
To: None
From: None
Subject: None
===
-----Original Message-----
From: Winckowski, Michele
Sent: Wednesday, October 17, 2001 4:16 PM
Subject: FW: Things to do when the boss is out.
- cubicle hurdles.mpeg
- Hallway races.mpeg
- Rowing.mpeg
Any idea how I could solve this one? (Plan B would be to run a regex on the text variable if everything else failed before)
Thanks, glad to hear you find it useful!
We had a similar report some time ago (#10) about problems parsing mails from that "enron" dataset.
The issue is with "Microsoft Mail Internet Headers Version 2.0" on line-6.
That line is not valid and causes the parser to stop there.
If you could just strip that line before feed the content into eml_parser you should be good.