GOVCERT-LU/eml_parser

"To", "From" or "Subject" sometimes None while analyzing the enron E-Mails

Mr-Mistoffelees opened this issue · 1 comments

Hi all,

at first thank you developing this great library which saves lots of time for me.

I'm actually working with the Enron Dataset. Usually, the eml_parser works like a charm. But there are some mails where both, the eml_parser and pythons email library, return None as value for "to", "from" or "subject", while Thunderbird displays the e-mail correctly (with from, to, subject, etc.).

I have to admit that many mails from the Enron Dataset are from 2001 or earlier.

Taking a random mail with three attachments:

Date: Tue, 20 Nov 2001 11:11:16 -0800 (PST)
Message-ID: <AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_186_25054702.1274862135875"
Microsoft Mail Internet Headers Version 2.0
X-MimeOLE:  Produced By Microsoft Exchange V6.0.4712.0
content-class:  urn:content-classes:message
Subject:  FW: Things to do when the boss is out.
X-MS-Has-Attach:  yes
X-MS-TNEF-Correlator:  <AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>
Thread-Topic:  FW: Things to do when the boss is out.
Thread-Index:  AcFWZ093VuF/8cJaEdWxIgBQi+MJ2QALiYaAAB4BjBAACOWUMAACiWdwAAM4b4AAAjOG4AapitwQ
From: "Ring  Andrea" <Andrea.Ring@ENRON.com>
To: "Brawner  Sandra F." <Sandra.F.Brawner@ENRON.com>
X-ZL-From:  Ring, Andrea </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ARING>
X-ZL-To:  Brawner, Sandra F. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbrawne>
X-ZL-Subject:  FW: Things to do when the boss is out.
X-Filename:  SBRAWNE (Non-Privileged).pst
X-Folder:  \Inbox
X-SDOC:  1385384
X-ZLID:  zl-edrm-enron-v2-brawner-s-1359.eml
X-ZL-Date:  Tue, 20 Nov 2001 13:11:16 -0600

------=_Part_186_25054702.1274862135875
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit



 -----Original Message-----
From: 	Winckowski, Michele  
Sent:	Wednesday, October 17, 2001 4:16 PM
Subject:	FW: Things to do when the boss is out.

  
 - cubicle hurdles.mpeg 
 - Hallway races.mpeg 
 - Rowing.mpeg 



------=_Part_186_25054702.1274862135875
Content-Type: application/octet-stream; name="cubicle hurdles.mpeg"
Content-Transfer-Encoding: base64
Content-Disposition: ATTACHMENT; filename="cubicle hurdles.mpeg"

AAABuiEAAQAXgArdAAABuwAMgArdBOH/4OAMwMAgAAABvgfcD///////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
----------> LOTS OF BASE64 Code left out <----------
------=_Part_186_25054702.1274862135875--

Trying to extract sender and recipient, I use this code...

import pathlib

sample = pathlib.Path('test/test3.eml')

with sample.open('rb') as fhdl:
    raw_email = fhdl.read()

ep = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False)
parsed_eml = ep.decode_email_bytes(raw_email)

print(parsed_eml['header'])

which leads to the following outcome:

{'subject': '', 'defect': [''], 'to': [], 'date': datetime.datetime(2001, 11, 20, 11, 11, 16, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=57600))), 'received': [], 'header': {'message-id': ['<AECCD639E83D0540BA407A252A23E53D229F1F@NAHOU-MSMBX03V.corp.enron.com>'], 'date': ['Tue, 20 Nov 2001 11:11:16 -0800'], 'mime-version': ['1.0'], 'content-type': ['multipart/mixed; boundary="----=_Part_186_25054702.1274862135875"']}}

Same using the native python library (where the values I'm searching for are located in the 'text' variable)
My code:

from pathlib import Path
from email import policy
from email.parser import BytesParser

myfile = 'test/test3.eml'
with open(myfile, 'rb') as fp:
    msg = BytesParser(policy=policy.default).parse(fp)
    text = msg.get_body(preferencelist=('plain')).get_content()
    fp.close()
    print('To:', msg['to'])
    print('From:', msg['from'])
    print('Subject:', msg['subject'])
    print("\n===")
    print(text)

Output:

To: None
From: None
Subject: None

===


 -----Original Message-----
From: 	Winckowski, Michele  
Sent:	Wednesday, October 17, 2001 4:16 PM
Subject:	FW: Things to do when the boss is out.

  
 - cubicle hurdles.mpeg 
 - Hallway races.mpeg 
 - Rowing.mpeg 

Any idea how I could solve this one? (Plan B would be to run a regex on the text variable if everything else failed before)

Hi @Mr-Mistoffelees

Thanks, glad to hear you find it useful!

We had a similar report some time ago (#10) about problems parsing mails from that "enron" dataset.
The issue is with "Microsoft Mail Internet Headers Version 2.0" on line-6.
That line is not valid and causes the parser to stop there.
If you could just strip that line before feed the content into eml_parser you should be good.