ikatyang/cjk-regex

discussion: rewrite the package to meet w3c typography requirements and complete character coverage

JLHwung opened this issue · 2 comments

It is a following-up from #47.

Properties Coverage instead of Blocks Coverage

I don't think it is sustainable to maintain a long list of Unicode blocks (while unfortunately most isCJK js utility will do as far as I know). We should make a step forward to make use of properties defined in UCD and maintained by Unicode experts. For example, we can choose all encoded characters satisfying the following constraints:

Script=Han
General_Category=Other_Letter|Letter_Number|Other_Symbol

The semantics of General Category is here. By doing so we can abstract from the concrete Unicode blocks and work on character properties.

Terminology accordance with Unicode

Characters: An association between abstract character and a code point (D11 Encoded Characters defined here)
Punctuations: Any character with General_Category = Punctuation

Our definition on our specific purpose

cjk-punctuation: (Some list of blocks to be discussed, I mostly agree with current blocks except for Hangul Syllables)
Letter: Any character with General_Category = Other_Letter | Letter_Number | Other_Symbol
cjk-letter: The Letter with Script=Han, Katakana, Hiragana, Hangul.
Other: Any character is neither cjk-punctuation nor cjk-letter:.

Compliant to w3c typography requirements

As far as I know from npm package, prettier is the only dependents of this new project. So we can rethink the use case of this package:

According to prettier/prettier#3026, the requirements of printer-markdown can be rephrased by the new terminology as:

  1. put line(" " or "\n") between Other and cjk-letter
  2. put softline("" or "\n") between cjk-letter and cjk-letter
  3. put nothing between Other and cjk-punctuation, i.e. they're considered not breakable

The Requirement 1 does follow the requirements of Chinese Text Layout, Japanese Text Layout, and Korean Text Layout.

Although these requirements all specify complicated line breaking rules, we can and we should only implement a tiny subset of them. On this principle The Requirement 2 is acceptable for both Japanese and Chinese. However, as noted by pp. 518 of CJKV Information Processing, Korean text is composed of Hangul and uses conventional space, more like western typography than Chinese/Japanese. So we should better do nothing between Hangul Syllables/Jamos. I guess this is the reason why Hangul Syllables is categorized as cjk_punctuations.

The Requirement 3 is acceptable as-is.

Solution

We should split cjk-letter into two class:
cj-letter: cjk-letter with Script=Han, Katakana, Hiragana
Hangul: cjk-letter with Script=Hangul

And revise the requirement to match w3c typography requirements

  1. put line(" " or "\n") between Other and cjk-letter
  2. put softline("" or "\n") between cjk-letter and cjk-letter, except Hangul and Hangul.
  3. put nothing between Other and cjk-punctuation, i.e. they're considered not breakable

This part should be done on prettier side. But it implies that we should have cjk-regex to expose more interface: Hangul

Technical Notes

  1. The current implementation of regex without unicode flag will be unmaintainable once we support the SIP characters. We should use unicode flag and use regexpu-core to transpile to ES5.

  2. We don't have to maintain the blocks but simply use unicode-data to generate our code points, filted the necessary code points and converted back to unicode regex.

  3. The introduced extra computation logic is wasteful because once we pick up a unicode-data version the regex will be generated deterministically, we can use prepack to generate a evaluated build.

Thanks for your patience reading this long issue. 😄

Looks great! I'll probably rewrite unicode-regex using unicode-data and regexpu-core this weekend, and also prepack the cjk part in this package.

#47 is still breaking after rebasing on the 1.0.1. The rare blocks is still missing from the source.