discussion: rewrite the package to meet w3c typography requirements and complete character coverage
JLHwung opened this issue · 2 comments
It is a following-up from #47.
Properties Coverage instead of Blocks Coverage
I don't think it is sustainable to maintain a long list of Unicode blocks (while unfortunately most isCJK
js utility will do as far as I know). We should make a step forward to make use of properties defined in UCD and maintained by Unicode experts. For example, we can choose all encoded characters satisfying the following constraints:
Script=Han
General_Category=Other_Letter|Letter_Number|Other_Symbol
The semantics of General Category is here. By doing so we can abstract from the concrete Unicode blocks and work on character properties.
Terminology accordance with Unicode
Characters: An association between abstract character and a code point (D11 Encoded Characters defined here)
Punctuations: Any character with General_Category = Punctuation
Our definition on our specific purpose
cjk-punctuation: (Some list of blocks to be discussed, I mostly agree with current blocks except for Hangul Syllables)
Letter: Any character with General_Category = Other_Letter | Letter_Number | Other_Symbol
cjk-letter: The Letter with Script=Han, Katakana, Hiragana, Hangul.
Other: Any character is neither cjk-punctuation nor cjk-letter:.
Compliant to w3c typography requirements
As far as I know from npm package, prettier
is the only dependents of this new project. So we can rethink the use case of this package:
According to prettier/prettier#3026, the requirements of printer-markdown
can be rephrased by the new terminology as:
- put line(" " or "\n") between Other and cjk-letter
- put softline("" or "\n") between cjk-letter and cjk-letter
- put nothing between Other and cjk-punctuation, i.e. they're considered not breakable
The Requirement 1 does follow the requirements of Chinese Text Layout, Japanese Text Layout, and Korean Text Layout.
Although these requirements all specify complicated line breaking rules, we can and we should only implement a tiny subset of them. On this principle The Requirement 2 is acceptable for both Japanese and Chinese. However, as noted by pp. 518 of CJKV Information Processing, Korean text is composed of Hangul and uses conventional space, more like western typography than Chinese/Japanese. So we should better do nothing between Hangul Syllables/Jamos. I guess this is the reason why Hangul Syllables
is categorized as cjk_punctuations.
The Requirement 3 is acceptable as-is.
Solution
We should split cjk-letter
into two class:
cj-letter: cjk-letter with Script=Han, Katakana, Hiragana
Hangul: cjk-letter with Script=Hangul
And revise the requirement to match w3c typography requirements
- put line(" " or "\n") between Other and cjk-letter
- put softline("" or "\n") between cjk-letter and cjk-letter, except Hangul and Hangul.
- put nothing between Other and cjk-punctuation, i.e. they're considered not breakable
This part should be done on prettier side. But it implies that we should have cjk-regex
to expose more interface: Hangul
Technical Notes
-
The current implementation of regex without
unicode
flag will be unmaintainable once we support the SIP characters. We should useunicode
flag and use regexpu-core to transpile to ES5. -
We don't have to maintain the blocks but simply use unicode-data to generate our code points, filted the necessary code points and converted back to unicode regex.
-
The introduced extra computation logic is wasteful because once we pick up a
unicode-data
version the regex will be generated deterministically, we can use prepack to generate a evaluated build.
Thanks for your patience reading this long issue. 😄
Looks great! I'll probably rewrite unicode-regex
using unicode-data
and regexpu-core
this weekend, and also prepack
the cjk part in this package.