papnkukn/eml-format

multi-part MIME not correctly handled

noktilux opened this issue · 5 comments

here is a message containing a small attached jpeg and a little message ("please see attached"). your parser is only picking up the attached jpeg and not getting the little message:

http://qstatistic.com/debug/sample_email.txt

i do not get the text with either the "read" or "parse" function.

i have had a look at the code and the issue is in this line:

if (lines[i - 1] == "" && line.indexOf("--" + findBoundary) == 0 && !/\-\-(\r?\n)?$/g.test(line)) {

the first bit -- looking for empty string -- is not valid in the test message i linked to above.

line[i-1] consists of "This is a multi-part message in MIME format."

i don't understand the logic of looking for the empty string -- why is finding the boundary string not enough here?

can somebody please say if this issue report has been seen?

hi2u commented

I was also having problems parsing over 50% of my emails due to some mime header content I think. Not sure if your issue is related, but here's what worked for me...

Didn't work:

emlformat.read(eml, { headersOnly: true }, function(error, data) {...}

The error I was getting was:

TypeError: Cannot read property 'length' of undefined
    at _read (node_modules/eml-format/lib/eml-format.js:466:39)
    at node_modules/eml-format/lib/eml-format.js:518:7
    at Object.emlformat.parse (node_modules/eml-format/lib/eml-format.js:554:5)
    at Object.emlformat.read (node_modules/eml-format/lib/eml-format.js:516:15)

Did work:

I just changed the headersOnly option to false and 100% of my emails were parsed...

emlformat.read(eml, { headersOnly: false }, function(error, data)

@noktilux thanks for providing the example.

That is correct, the issue is in the condition that strictly requires an empty line before the multi-part boundary marker

if (lines[i - 1] == "" && line.indexOf("--" + findBoundary) == 0 && !/\-\-(\r?\n)?$/g.test(line)) {

Solved by removing the lines[i - 1] == "" ("previous line should be blank") condition from the if statement.

Issues has been fixed with version 0.6.0.

Just to provide an example.

So if the EML looks like this, i.e. with no new line after This is a multi-part message in MIME format.

....
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------194F0B6C07FF2414138ED9B2"
Content-Language: en-US

This is a multi-part message in MIME format.
--------------194F0B6C07FF2414138ED9B2
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

please see attached


--------------194F0B6C07FF2414138ED9B2
Content-Type: image/jpeg;
 name="tired_boot.FJ010019.jpeg"
...

The eml-format should now read it as

{
  "date": "2018-04-29T18:05:09.000Z",
  ...
  "text": "please see attached\r\n\r\n",
  "attachments": [
    {
      "name": "tired_boot.FJ010019.jpeg", 
      ...
}

with the text property.