bbottema/outlook-message-parser

Bug: __nameid_ directory should not be parsed (and causing invalid HTML body)

derrohrbach opened this issue · 13 comments

As discussed in PR #22, see also nameid-fix branch:

In a MSG file the _nameid... directory represents named properties, which are as of right now not supported by this library. It tries to parse the entries in this directory using the normal property parser though, which creates conflicts and is wrong. You can also refer to the Microsoft documentation for this: [MS-OXMSG].

I had several emails where an id in this directory collided with the property id for HTML-Content (10130102), which caused RTF-Emails to incorrctly report this invalid content as the HTML body, which prevented the RTF Conversion from being read.

grafik

The fix in #22 could not be merged because of Tests failing! I can not reproduce this though, because my tests always fail with this error:

[ERROR] Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 0.07 s <<< FAILURE! - in org.simplejavamail.outlookmessageparser.HighoverEmailsTest
[ERROR] testUnicodeMessage(org.simplejavamail.outlookmessageparser.HighoverEmailsTest)  Time elapsed: 0.009 s  <<< ERROR!
java.lang.Error:
Unresolved compilation problems:
        OutlookMessageAssert cannot be resolved
        OutlookMessageAssert cannot be resolved
        OutlookMessageAssert cannot be resolved
        OutlookMessageAssert cannot be resolved
        The method normalizeText(String) is undefined for the type HighoverEmailsTest

        at org.simplejavamail.outlookmessageparser.HighoverEmailsTest.testUnicodeMessage(HighoverEmailsTest.java:520)

Ahh, those are classes generated by a maven plugin when you run mvn test

Well it seems for me those are not being generated, because i run using mvn test

Are there any news about these errors? I would really like to run the tests and fix the bugs ;)

I really don't have a clue why mvn test wouldn't work for you. Except maybe that you're JDK is too new? I'm developing this library with JDK 1.8 (and Maven 3.6.0).

Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: C:\Maven\bin\..
Java version: 1.8.0_222, vendor: AdoptOpenJDK, runtime: C:\Program Files\AdoptOpenJDK\jdk-8.0.222.10-hotspot\jre
Default locale: de_DE, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

I got something running by copying the class files from target/test-classes to target/classes... Now trying to look into the tests

That's just bizar that you have to copy anything, I have no explanation for it.

Well a second run of the tests does not work either and fails with an error about something being already instrumented. My current procedure for testing:

  1. mvn clean test
  2. copy files
  3. mvn test

I was able to verify though that it's an error in the tests and not in the fixed code. It always fails at getBodyHTML().isNotEmpty(), because before you had the invalid body content in there. Now it is null, which is correct, because there is no HTML body in these four mails.

To verify this just look at the supposed HTML content which the current code produces for the following mails:

  • plain chain.msg
  • chinese message.msg
  • forward with attachments and embedded images.msg
  • nested simple mail.msg

You will see that the content is just garbled data. That is essentially the bug, which the patch fixed.
I will open a new pull request with the tests fixed.

Thanks for the research, much appreciated. Before I can merge the change, I would like to investigate why body HTML remains empty even if there is an RTF body. I'll be able to check that this weekend.

Well thats a different issue though, isn't it? The behaviour with this bugfix is better than before, because you can at least see, that there is no HTML Body, instead of getting invalid data. Also how would you detect if it is a real html email or if it was converted from rtf, if the html attribute is always set?

Indeed and I wasn't looking properly (it's been some time since I touched that part of the code): the converted HTML is there, but in msg.getConvertedBodyHTML() instead of msg.getBodyHTML(). so that's all hunky-dory. I'll prepare a release soon.

Released in 1.7.1