bbottema/outlook-message-parser

SimpleRTF2HTMLConverter removes some valid tags during conversion

fadeyev opened this issue · 10 comments

The following line in Outlook msg RTF file:
{\*\htmltag241 -->}
should turn into --> during conversion to HTML.
However it's replace with empty string by line 139 of SimpleRTF2HTMLConverter
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", "");
Thus resulting HTML file is broken and has invalid structure

This library is a pain to debug (I didn't build it from the ground up). I'll gladly take PR's if you have time for it :)

Meanwhile, I'll see if there are some quick wins from your bug reports (thanks for them btw!)

Regarding this specific case, I can easily include an exception for -->, but are there other types like this? Right now the regex generically (and aggressively) clean up the content.

//filtering embedded tags like {\*\htmltag64 <tr>}                                          }
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+(<.+>)\\}", "$1");
//filtering embedded tags like {\*\htmltag84 &#43;}
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+[^\\}<]+\\}", "");

Basically what happens is that any HTML that is not in tags is removed.

I don't quite understand what we are removing here. For example, in lines {\*\htmltag64 <tr>} and {\*\htmltag84 &#43;}, aren't <tr> and + sign parts of the original HTML email content?
I tried to just remove the second replacement completely and resulting HTMLs look fine to me, tried on several emails.

I don't quite understand what we are removing here.

Everything that is a HTML tag encoded as {htmltag <thetag>} is replaced by only <thetag>. Then the remaining encoded tags {htmltag whatever} are removed. You won't see the difference if the second type of encoded tags don't exist in your email. So this removes -->, but keeps for example <span>--></span>.

This is because RTF can be quite polluted with crap htmltags. Discerning which ones to keep is nigh impossible, except for specific exceptions like your - ->

Ok, I wrote a simple test - tried to parse 10 really complex emails with and without these two lines. Now I see what you mean - HTML entities are present twice in RTF, once as {htmltag &#43;} and then goes \htmlrtf +
Anyway, here's the summary of what got removed as a result of the test on 10 really complex files:

  1. <title> tag contents - not sure if it's a bug or not - I guess I'd like it to be kept
  2. part of CSS style definition looking like html, body {...} - seems like a bug as well
  3. Piece of HTML that was initially reported: -->
  4. Hundreds of &nbsp; - that's fine
  5. HTML entities like &#43;, &amp;, &gt;, etc. which duplicated by \htmlrtf - actually it's a bug as well, for &gt; it will produce invalid html

I suggest replacing two lines above with he following:
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+ (&[#\\w]+;)}\\\\htmlrtf.*\\\\htmlrtf0 ", "$1");
This works only for HTML entities and leaves out only HTML entity in place instead of the actual character.
I tested it on 10 complex emails - seems to work fine

Btw, the same code is in simple-java-mail library. Are you planning to make simple-java-mail library depend on this one and remove parsing classes from there?

P.S. both library are very cool, convenient and high-quality

I compared resulting HTML with the one produced by Outlook (when right clicking on an email and selecting View Source) - the two still have many small differences, many small bugs. My thinking that the class SimpleRTF2HTMLConverter needs to be completely rewritten and should use ANTLR. Probably I can do it in a month or so - too busy to do it right now.

Btw, the same code is in simple-java-mail library. Are you planning to make simple-java-mail library depend on this one and remove parsing classes from there?

Simple Java Mail already depends on this library.

My thinking that the class SimpleRTF2HTMLConverter needs to be completely rewritten and should use ANTLR.

Oh sure, I'm just happy to get a workable result at all at this point. Having an exact parser would be great though.

Long time ago, I searched for (lightweight) libraries dedicated to converting RTF to HTML, but at the time they didn't exist for Java. Maybe there's something out there now, though. I'll have a gander.

I suggest replacing two lines above with he following:
replacedText = replacedText.replaceAll("\\{\\\\\\*\\\\htmltag\\d+ (&[#\\w]+;)}\\\\htmlrtf.*\\\\htmlrtf0 ", "$1");
This works only for HTML entities and leaves out only HTML entity in place instead of the actual character.
I tested it on 10 complex emails - seems to work fine

Awesome, I'll have a look.

Released v1.3.0

Released v1.4.0, which now uses the new and improved spec-compliant RFC converter!