unicode-org/unihan-database

CJK Compatibility Ideographs

paulmasson opened this issue · 7 comments

This issue is in reference to the recent commit 95a0718. While I agree that most of these characters add no information to this part of the database, there a few cases that should be revisited.

These two characters were recently added by me, the first in #124, the second in #133:

U+F9A8 令 kPhonetic 812
U+FA5B 者 kPhonetic 94

Both of these variants appear in Casey, which is why I added them. These should be restored.

This character was add recently by me in #49 as part of issue #48 :

U+2F879 峀 kPhonetic 1512*

This variant draws the bottom half of the character in a way that shows the connection of U+5CC0 峀 to the group more clearly. Unless that visualization is not assured across platforms, it too should be restored.

This character has two variants in Casey, one without a dot and one with a dot:

U+F970 殺 kPhonetic 1111

Again, unless that visualization is not assured across platforms, it too should be restored.

Finally, the group from which these two characters were removed has four entries in Casey:

U+F98E 年 kPhonetic 977
U+F995 秊 kPhonetic 977

One my devices these two render precisely the same as the other two characters, so they don't appear to capture the information in Casey. I am ambivalent about restoring these two.

None of these should be restored. I will explain tomorrow when my mind is fresh.

All CJK Compatibility Ideographs normalize to corresponding CJK Unified Ideographs, and the CJK Unified Ideographs to which CJK Compatibility Ideographs are normalized are referred to as canonical equivalents. The following are the canonical equivalents for the six ones that you cited, all of which are associated with the same kPhonetic property values:

U+F9A8 令 = U+4EE4 令 (565 812)
U+FA5B 者 = U+8005 者 (94)

The above CJK Compatibility Ideographs have a K- or J-source, and if you look at the code chart glyphs for their canonical equivalents, you will see the same glyphs under two of the sources.

U+2F879 峀 = U+5CC0 峀 (1512*)

The above CJK Compatibility Ideograph will soon be orphaned, probably for Unicode Version 17.0 (2025), because U+5CC0 峀 will likely be disunified per document WG2 N5259 (aka IRG N2676 + ROK feedback):

https://www.unicode.org/wg2/docs/n5259-IRGN2676Disunify5CC0.pdf

The likely code point of the disunified form, which looks like U+2F879 峀, is U+2B73A.

U+F970 殺 = U+6BBA 殺 (46 1111 1281)
U+F98E 年 = U+5E74 年 (977)
U+F995 秊 = U+79CA 秊 (192 977)

The above three CJK Compatibility Ideographs are considered true duplicates of their canonical equivalents, at least when it comes to the K-source of their canonical equivalents.

Keep in mind that how ideographs appear on a particular platform depends on several factors, such as the platform itself (macOS versus Windows), the available fonts, and the language settings of the OS. It is always best to avoid CJK Compatibility Ideographs. WG2 and the UTC stopped accepting them over 10 years ago due to the issues that they cause.

From a technological point of view, I understand why you would want to discourage the use of compatibility ideographs in favor of their canonical equivalents. What bothers me is that Casey has cases, as noted above, where he explicitly includes variants of the root phonetic. Someone comparing Casey to the database will see discrepancies for these cases. How do you make it clear to that person that the data is accurate?

At the very least, the description of kPhonetic in the documentation should state that compatibility ideographs are explicitly excluded from this field.

I may or may not have time to sufficiently explain this issue before I fly to Japan on Star Wars Day, but the main thing to consider is that relying on the glyphs that the OS displays is not a good way of determining that the property value is appropriate. It is better to use the multicolumn code charts for the 10 CJK Unified Ideographs blocks for this purpose.

For example, consider U+F970 殺 (1111) versus U+6BBA 殺 (46 1111 1281). Both forms—with and without the dot—appear in the multicolumn entry for U+6BBA.

U+6BBA

@kenlunde I have left this issue open in the event you would like to add a comment to the documentation for the kPhonetic field stating that compatibility characters are explicitly excluded. If you do not feel this is important, then this issue can be closed.

We generally do not explicitly state that CJK Compatibility Ideographs are excluded from Unihan database properties, so I do not see a strong reason for stating this in the Description of this particular property.