Converting from RTF (using CJK characters) into Markdown causes CJK characters messed up
kenjiuno opened this issue · 10 comments
Explain the problem.
Converting from RTF (using characters of Chinese, Japanese, and Korean languages) into Markdown causes CJK characters to be messed up.
Hello! English and CJK.rtf
(input)
{\rtf1\ansi\ansicpg932\deff0\nouicompat\deflang1033\deflangfe1041{\fonttbl{\f0\fnil\fcharset128 Arial Unicode MS;}{\f1\fnil\fcharset129 Arial Unicode MS;}}
{\*\generator Riched20 10.0.19041}\viewkind4\uc1
\pard\sa200\sl276\slmult1\f0\fs22\lang17 Hello! English and CJK\par
\u20320?\'8d\'44\'81\'49\par
\lang1041\'82\'b1\'82\'f1\'82\'c9\'82\'bf\'82\'cd\'81\'49\par
\f1\'be\'c8\'b3\'e7\'c7\'cf\'bc\'bc\'bf\'e4\f0\lang1033 !\lang17\par
}
Open this input with Windows Wordpad (write.exe "Hello! English and CJK.rtf"
).
Command:
pandoc -o "Hello! English and CJK.md" "Hello! English and CJK.rtf"
Hello! English and CJK.md
(output)
Actual
Hello! English and CJK
你D�I
‚±‚ñ‚É‚¿‚Í�I
¾È³çÇϼ¼¿ä!
Expected
Hello! English and CJK
你好!
こんにちは!
안녕하세요!
Pandoc version?
Pandoc is 3.1.13
which is installed with pandoc-3.1.13-windows-x86_64.msi
.
pandoc 3.1.13
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: C:\Users\KU\AppData\Roaming\pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
Using Windows 10 Pro, Japanese edition.
Microsoft Windows [Version 10.0.19045.4291]
We don't support \ansicpg
and I think that may be the issue here.
Your document specifies code page 932, which is
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
for Japanese.
To improve pandoc, we'd need to page it sensitive to \ansicpg
and implement lookup tables for various ANSI code pages.
Here is a code page 932 to unicode conversion:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
But note! sequences like \'82\'B1
map to 0x82B1, a two byte sequence. This will require changes in how we handle things.
Currently we process tokens, including hex escapes, one by one. We'll either need to add some state or change the function so it can gobble multiple tokens.
I've added some infrastructure to handle this, but the final piece will require an 8000 line lookup table for cp932. This probably needs to be implemented as a separate library.
Aha, there is https://hackage.haskell.org/package/encoding-0.8.1
But it might be nice to avoid the transitive dependencies. We could, perhaps, just copy some code from the relevant module.
Hi, thanks for investigating about this.
There is a big pitfall about this RTF.
This RTF mixes 2 code pages: CP932
(a Japanese local charset aka Shift_JIS) and CP949
(a Korean local charset) into one RTF.
\ansicpg932
tells only about one of them.
While, 2 Chinese characters are represented as:
你
as\u20320?
(not available in\ansicpg932
)好
as\'8d\'44
(A CP932 character 0x8D44)
Perhaps \fcharset128
and \fcharset129
may indicate the difference.
Although I'm not good at RTF file specification so much, I have found the description from the documentation.
[MS-OXRTFCP]: Informative References | Microsoft Learn
Page 14, /ansicpgN
is described
Page 20, /fcharsetN
is described
If this entire thing is too complex to handle, IMO it will be ok to display warning about usage of non-Unicode characters in RTF file instead. (It means that the user needs to convert all of the non-Unicode characters into Unicode characters with any tools that it can handle instead)
What a format! Apparently we need to pay attention both to ansicpgN
(specifies code page) and to fcharsetN
(specifies a character set of a font in the font table). I understand the former but not really the latter. It seems that \fcharset128
is Shift Jis and 129 is Hangul. Does that mean it will work if I decode using code page 949 (Korean) for \fcharset129
? I'm unclear on the relation between fcharset and ansicpg. (EDIT: Sorry, I see that you answered this question with your last excerpt from the manual.)
Some helpful info here:
https://latex2rtf.sourceforge.net/rtfspec_6.html
Note: apparently CP929 is an extension of Shift Jis.
So, I've figured out how to do the decoding.
What we need now is to implement language shifting with \langN
and \deflangN
. Language shifts will need to induce a shift in the active code page, according to the table you've given above.
There's a list of standard language codes here
https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch04.html
Maybe we can get by with those.
\deflang
is for the default language.
\deflangfe
is the default asian language.
OK, this is not going to work. The encoding library doesn't even have the Korean code page. And it takes 7 minutes to compile and adds some new dependencies. I'm taking your suggestion and just emitting a warning:
[WARNING] Unsupported code page 932. Text will likely be garbled.
Hello! English and CJK
你D�I
‚±‚ñ‚É‚¿‚Í�I
¾È³çÇϼ¼¿ä!