jgm/pandoc

Converting from RTF (using CJK characters) into Markdown causes CJK characters messed up

kenjiuno opened this issue · 10 comments

Explain the problem.

Converting from RTF (using characters of Chinese, Japanese, and Korean languages) into Markdown causes CJK characters to be messed up.

Hello! English and CJK.rtf (input)

{\rtf1\ansi\ansicpg932\deff0\nouicompat\deflang1033\deflangfe1041{\fonttbl{\f0\fnil\fcharset128 Arial Unicode MS;}{\f1\fnil\fcharset129 Arial Unicode MS;}}
{\*\generator Riched20 10.0.19041}\viewkind4\uc1 
\pard\sa200\sl276\slmult1\f0\fs22\lang17 Hello! English and CJK\par
\u20320?\'8d\'44\'81\'49\par
\lang1041\'82\'b1\'82\'f1\'82\'c9\'82\'bf\'82\'cd\'81\'49\par
\f1\'be\'c8\'b3\'e7\'c7\'cf\'bc\'bc\'bf\'e4\f0\lang1033 !\lang17\par
}
 

Open this input with Windows Wordpad (write.exe "Hello! English and CJK.rtf").

2024-04-22_16h55_10

Command:

pandoc -o "Hello! English and CJK.md" "Hello! English and CJK.rtf"

Hello! English and CJK.md (output)

Actual

Hello! English and CJK

你D�I

‚±‚ñ‚É‚¿‚Í�I

¾È³çÇϼ¼¿ä!

Expected

Hello! English and CJK

你好!

こんにちは!

안녕하세요!

Pandoc version?

Pandoc is 3.1.13 which is installed with pandoc-3.1.13-windows-x86_64.msi.

pandoc 3.1.13
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: C:\Users\KU\AppData\Roaming\pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Using Windows 10 Pro, Japanese edition.

Microsoft Windows [Version 10.0.19045.4291]
jgm commented

We don't support \ansicpg and I think that may be the issue here.
Your document specifies code page 932, which is
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
for Japanese.

To improve pandoc, we'd need to page it sensitive to \ansicpg and implement lookup tables for various ANSI code pages.

jgm commented

Here is a code page 932 to unicode conversion:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
But note! sequences like \'82\'B1 map to 0x82B1, a two byte sequence. This will require changes in how we handle things.
Currently we process tokens, including hex escapes, one by one. We'll either need to add some state or change the function so it can gobble multiple tokens.

jgm commented

I've added some infrastructure to handle this, but the final piece will require an 8000 line lookup table for cp932. This probably needs to be implemented as a separate library.

jgm commented

Aha, there is https://hackage.haskell.org/package/encoding-0.8.1
But it might be nice to avoid the transitive dependencies. We could, perhaps, just copy some code from the relevant module.

Hi, thanks for investigating about this.

There is a big pitfall about this RTF.

This RTF mixes 2 code pages: CP932 (a Japanese local charset aka Shift_JIS) and CP949 (a Korean local charset) into one RTF.
\ansicpg932 tells only about one of them.

While, 2 Chinese characters are represented as:

  1. as \u20320? (not available in \ansicpg932)
  2. as \'8d\'44 (A CP932 character 0x8D44)

Perhaps \fcharset128 and \fcharset129 may indicate the difference.

2024-04-23_06h55_13

Rich Text Format - Wikipedia


Although I'm not good at RTF file specification so much, I have found the description from the documentation.

[MS-OXRTFCP]: Informative References | Microsoft Learn

Page 14, /ansicpgN is described

2024-04-23_07h07_14

Page 20, /fcharsetN is described

2024-04-23_07h06_16


If this entire thing is too complex to handle, IMO it will be ok to display warning about usage of non-Unicode characters in RTF file instead. (It means that the user needs to convert all of the non-Unicode characters into Unicode characters with any tools that it can handle instead)

jgm commented

What a format! Apparently we need to pay attention both to ansicpgN (specifies code page) and to fcharsetN (specifies a character set of a font in the font table). I understand the former but not really the latter. It seems that \fcharset128 is Shift Jis and 129 is Hangul. Does that mean it will work if I decode using code page 949 (Korean) for \fcharset129? I'm unclear on the relation between fcharset and ansicpg. (EDIT: Sorry, I see that you answered this question with your last excerpt from the manual.)

Some helpful info here:
https://latex2rtf.sourceforge.net/rtfspec_6.html

Note: apparently CP929 is an extension of Shift Jis.

jgm commented

So, I've figured out how to do the decoding.
What we need now is to implement language shifting with \langN and \deflangN. Language shifts will need to induce a shift in the active code page, according to the table you've given above.

jgm commented

There's a list of standard language codes here
https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html
Maybe we can get by with those.

jgm commented

\deflang is for the default language.
\deflangfe is the default asian language.

jgm commented

OK, this is not going to work. The encoding library doesn't even have the Korean code page. And it takes 7 minutes to compile and adds some new dependencies. I'm taking your suggestion and just emitting a warning:

[WARNING] Unsupported code page 932. Text will likely be garbled.
Hello! English and CJK

你D�I

‚±‚ñ‚É‚¿‚Í�I

¾È³çÇϼ¼¿ä!