bbottema/outlook-message-parser

Wrong encoding for bodyHTML

Faelean opened this issue · 7 comments

If an email contains bodyHTML (mapi 0x1013) that is encoded in for example UTF-8 the parser ignores the encoding and uses CP1252 causing characters like ü being displayed as ü.

private String convertValueToString(final Object value) {
if (value == null) {
return null;
}
if (value instanceof String) {
return (String) value;
} else if (value instanceof byte[]) {
return new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
} else {
LOGGER.trace("Unexpected body class: {} (expected String or byte[])", value.getClass().getName());
return value.toString();
}
}

Problem is that the correct charset is not known when calling the String constructor. There might be a way to do this more efficient but this is what we've come up with to replace Line 259:

String convertedString = new String((byte[]) value, CharsetHelper.WINDOWS_CHARSET);
Pattern pattern = Pattern.compile("charset=(\"|)([\\w\\-]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(convertedString);
if(m.find()) {
	try {
		convertedString = new String((byte[]) value, Charset.forName(m.group(2)));
	} catch (Exception e) {
		//ignore and use default charset
	}
}
return convertedString;

First step, convert everything as before.
Second step, check the result String for a charset. The regex matches the following two pattern and extracts the charset:

<meta charset="utf-8" /> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

If there is a charset in the result String overwrite while using the correct charset, else use the already created String. The try/catch block is for the Charset.forName method in case someone messed up the charset in the bodyHTML.

Do you have an .msg for me with an HTML body? I have been unable to produce one, all the emails I save with Outlook are converted to RTF format in the .msg files.

I'm sorry but I don't have any that I can share.
We also haven't been able to create .msg files that have these problems, but the ones provided to us contain private information so I'm not allowed to share them.

Maybe this msg file was produced by an Exchange server directly? Or maybe in an older version of outlook.

Until I get a sample I can't do anything on my end. I tried googling some public examples, but came up empty.

I've managed to get an example mail, but they do not want this mail to be public. I can sent it to you privately, but you can not upload it to your test resources in this git repository.
If you're ok with this I can mail it to you, otherwise I'd have to try and get another one.

Excellent, of course I agree to those terms. Thank you.

Fixed in 1.7.7.

Btw, I'm using UTF8 as default now rather than the Windows encoding. Still the detection logic is still very useful for some exotic encodings like some chinese character sets.