PreviewText field in MessageSummary is incorrect for the email with charset="ks_c_5601-1987"

Question

PreviewText field in MessageSummary is incorrect for the email with charset="ks_c_5601-1987"

aldayneko opened this issue 7 months ago · 14 comments

IMailFolder.FetchAsync returns incorrect PreviewText in case when email has Content-Type: text/plain; charset="ks_c_5601-1987"
The result is "� � � �ȳ�ϼ� �, � �⸦ �ٶ�ϴ�. �̹� �ָ� � �ȹ� � �ֳ�? �ٵ� � �" but should be like this "안녕하세요 여러분 내 주제는 여기 안녕하세요 여러분"
This charset is registered on the environment and I can read the correct body when fetch body parts, but can't get the correct value inside PreviewText.
I can see in the code that finally PreviewText value is read like UTF8 but what to do if there is another encoding. Is there way to customize it ?

public async Task<string> ReadLiteralAsync (CancellationToken cancellationToken)
{
	if (Stream.Mode != ImapStreamMode.Literal)
		throw new InvalidOperationException ();

	int literalLength = Stream.LiteralLength;
	var buf = ArrayPool<byte>.Shared.Rent (literalLength);

	try {
		int n, nread = 0;

		do {
			if ((n = await Stream.ReadAsync (buf, nread, literalLength - nread, cancellationToken).ConfigureAwait (false)) > 0)
				nread += n;
		} while (nread < literalLength);

		try {
			return TextEncodings.UTF8.GetString (buf, 0, nread);
		} catch {
			return TextEncodings.Latin1.GetString (buf, 0, nread);
		}
	} finally {
		ArrayPool<byte>.Shared.Return (buf);
	}
}

Platform (please complete the following information):

MailKit Version: 4.6

Exception
NO

To Reproduce
Call FetchAsync and check the result when message has charset="ks_c_5601-1987"

Expected behavior
PreviewText should have readable value

Answer 1 · 2024-05-20T19:59:57.000Z

Could you construct a sample message with such a message body for me? This way I could add it to my unit tests, debug it, and figure out a proper solution?

Thanks!

Answer 2 · 2024-05-20T20:44:34.000Z

Sure, here is the text file with message source. Please, let me know if it's what you need
korean.txt
Thanks.

Answer 3 · 2024-05-21T15:51:22.000Z

Quick question:

I started to take a look at this and there are 2 implementations for preview text.

The server supports the PREVIEW feature.
The server does not support the PREVIEW feature, so MailKit attempts to fetch a small chunk of the message body itself.

Option 2 is by far the most prevalent because very few servers actually support the PREVIEW feature. That said, based on your original bug report, it sounds like you've already done some debugging and have narrowed the bug down to the ReadLiteral method doing incorrect character conversion.

Does that mean your server supports the PREVIEW feature?

I just want to make sure I'm looking into the correct bits of code.

Answer 4 · 2024-05-21T17:59:31.000Z

Yeah, I have been trying to debug it. Server doesn't support PREVIEW feature, so I see that previewText value finally is read as a small chunk of the body and decoded inside ReadLiteralAsync as UTF8, which is incorrect since it's not UTF8. It seems like we need somehow to put here the correct encoding from the body charset

Answer 5 · 2024-05-21T18:18:23.000Z

I would have assumed that https://github.com/jstedfast/MailKit/blob/master/MailKit/Net/Imap/ImapFolderFetch.cs#L918 would have taken care of this because QueueFetchPreviewCommand sets up a FetchPreviewContext which gets the raw body data added to it via the FetchStreamAsync callback method which consumes the body content into a Stream and then calls ctx.AddAsync() which eventually makes it to FetchPreviewTextContext.Add()

Perhaps what is happening is that there is a DecoderException for ks_c_5601-1987 (due to incomplete data that ends in the middle of a multibyte character?) which then causes fallback to iso-8859-1?

Answer 6 · 2024-05-21T18:43:35.000Z

Take a look at the above commit - it attempts to reproduce the issue in the unit tests but it works fine for me.

That said, I suspect that it's working for me because the unit test is likely missing something important but I don't know what it is.

As per my previous comment, perhaps the issue is that the unit test has a shorter message body than the message body that is causing you problems and if the message body were longer, then it might hit a charset conversion issue that results in fallback to iso-8859-1.

Do you think it would be possible for you to take a look at the above unit test that I just added use that as a template for obtaining an IMAP server response that causes this issue?

Answer 7 · 2024-05-23T14:05:50.000Z

I've scraped some Korean text off of Wikipedia to try and get a long string of text and it still works, so my thinking is that perhaps you did not register the international text encodings in your app?

Make sure at program startup, you call the following line of code:

System.Text.Encoding.RegisterProvider (System.Text.CodePagesEncodingProvider.Instance);

Answer 8 · 2024-05-23T15:27:57.000Z

Sorry, I didn't have time to try your recommendations above, I'll do it tomorrow or during the weekend

Regarding encodings - it's registered exactly as you wrote and I'm able to read body and subject in correct format when I call MailFolder.GetMessageAsync()

Answer 9 · 2024-05-24T16:23:35.000Z

I've tried to copy-paste responses from the server into unit test that you mentioned, but I'm not sure that I did it correctly.
I'm getting some exceptions about parsing string, probably because I copied wrong data.
So, I caught all raw logs via ProtocolLogger and wrote to the file
Could you please check it with this data ? It exactly what I get from the imap server
imap_log.txt

Answer 10 · 2024-05-24T21:23:15.000Z

Okay, so this is what we're interested in:

C: D00000007 UID FETCH 284 (UID FLAGS INTERNALDATE ENVELOPE BODYSTRUCTURE PREVIEW BODY.PEEK[HEADER.FIELDS (IMPORTANCE SECURE-REPLY-SENDER)])
S: * 144 FETCH (UID 284 FLAGS (\Seen) INTERNALDATE "17-May-2024 07:48:43 +0000" PREVIEW {213}
S: � � � �ȳ�ϼ� �, � �⸦ �ٶ�ϴ�. �̹� �ָ� � �ȹ� � �ֳ�? �ٵ� � �ñ� �ٶ�Կ�! � �ȹ�ϴ� �̵� �ֽ�ϴ�. � � �غ�. � �! ENVELOPE [snip] BODY[HEADER.FIELDS (IMPORTANCE SECURE-REPLY-SENDER)] {2}
S: 
S: )
S: D00000007 OK Fetch completed (0.001 + 0.000 secs).

Unfortunately, the IMAP server does claim to support the PREVIEW feature and it is providing the preview text in the wrong charset. It MUST provide the preview text in UTF-8, but it is providing it in ks_c_5601-1987 as you've noted.

You could try submitting a bug here: https://github.com/dovecot/core

The specification in question can be found here: https://www.rfc-editor.org/rfc/rfc8970.html
Specifically, section 3.3 notes:

The generated string MUST NOT be content transfer encoded and MUST be
encoded in UTF-8 [RFC3629].  The server SHOULD remove any formatting
markup and do whatever processing might be useful in rendering the
preview as plain text.

If you file a bug against Dovecot, feel free to /cc me so that I can follow along and chime in if needed.

As a work-around, you could try disabling server-side PREVIEW by doing this in your code after authenticating:

client.Capabilities &= ~ImapCapabilities.Preview;

This will unfortunately make the query a bit slower, but it should provide better results until Dovecot fixes the bug.

Hope that helps.

Answer 11 · 2024-05-27T11:59:45.000Z

Yes, you're right, the IMAP server supports preview.
Sorry, when I was debugging here https://github.com/jstedfast/MailKit/blob/master/MailKit/Net/Imap/ImapFolderFetch.cs#L1273 the previewText variable was false that's why I thought preview is not supported.
Btw, regarding rfc, the link that you've provided has a status 'Proposed Standard' - so, it's not final and we can't say it's a bug ( or we can ? :))
Anyway, we're trying to fix it with dovecat configuration. Disabling server-side PREVIEW is an option, but not preferable
Just a question - why we can't read response from preview base on body charset instead of UTF-8
Or it'll be breaking change because in most cases it's returned as UTF-8 ?

Answer 12 · 2024-05-27T22:05:58.000Z

the previewText variable was false that's why I thought preview is not supported.

Don't worry about it, I can understand why you misunderstood.

Btw, regarding rfc, the link that you've provided has a status 'Proposed Standard' - so, it's not final and we can't say it's a bug ( or we can ? :))

We can say it's a bug because clients and servers are supposed to implement specifications exactly or they aren't worth writing.

Anyway, we're trying to fix it with dovecat configuration. Disabling server-side PREVIEW is an option, but not preferable

You don't have to disable it in Dovecot in a config file, just disable it in MailKit by using the snippet I pasted in my previous comment.

Just a question - why we can't read response from preview base on body charset instead of UTF-8
Or it'll be breaking change because in most cases it's returned as UTF-8 ?

Because we don't necessarily know the body charset.

Answer 13 · 2024-05-28T07:49:07.000Z

if we don't know anything about body charset, how do we read Korean text in the subject ? I thought, in most cases it should be the same.

Answer 14 · 2024-05-28T13:45:19.000Z

Every header (or even every email address in a To/From/Cc header) and every MIME part of the message can use a different charset.

You can't rely on a subject charset to be the same as the body.