contour-terminal/libunicode

Hangul Jamo vowels and trailing consonants should probably be 0 width

Opened this issue · 3 comments

U+1160..U+11FF and U+D7B0..U+D7FF should have 0 width.

Korean Hangul is a writing system which uses syllable blocks consisting of alphabetic components. A syllable consists of one or more Leading Consonants, one or more Vowels, and zero or more trailing consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

  • Hangul Jamo (U+1100..U+11FF).
    • U+1100..U+115F Choseong (initial, Leading Consonants) have East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
    • U+1160..U+11A7 Jungseong (medial, Vowels) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
    • U+11A8..U+11FF Jongseong (final, Trailing consonants) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo
  • U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
  • U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have East_Asian_Width=Neutral
  • U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
  • U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining behavior, a sequence of L+V+T* gets rendered as a syllable block. wcwidth() implementations tend to give U+1100..U+115F width 2, and U+1160..U+11FF width 0, so the resulting syllable block has the correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <tg@mirbsd.de>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <carlos@redhat.com>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

Hey @ninjalj. Sorry for the late reaction. I want to take care of it ASAP, but time is limited recently. So if no one is faster (by accident), then I'll do it ASAP. Many thanks for the very detailed information also here. :)

Interesting, utf8proc has printproperty binary (enabled by -DUTF8PROC_ENABLE_TESTING=ON option).
Some codepoints:
$ printproperty 1110

U+1110: ᄐ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 1110 (seqindex ffff)
lowercase_mapping = 1110 (seqindex ffff)
titlecase_mapping = 1110 (seqindex ffff)
casefold = ᄐ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 6
charwidth = 2

$ printproperty 1160

U+1160: ᅠ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 1160 (seqindex ffff)
lowercase_mapping = 1160 (seqindex ffff)
titlecase_mapping = 1160 (seqindex ffff)
casefold = ᅠ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 1
control_boundary = 0
boundclass = 7
charwidth = 1

$ printproperty 11A8

U+11A8: ᆨ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = 11a8 (seqindex ffff)
lowercase_mapping = 11a8 (seqindex ffff)
titlecase_mapping = 11a8 (seqindex ffff)
casefold = ᆨ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 8
charwidth = 1

$ printproperty A960

U+A960: ꥠ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = a960 (seqindex ffff)
lowercase_mapping = a960 (seqindex ffff)
titlecase_mapping = a960 (seqindex ffff)
casefold = ꥠ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 6
charwidth = 2

$ printproperty D7B0

U+D7B0: ힰ
category = Lo
combining_class = 0
bidi_class = 1
decomp_type = 0
uppercase_mapping = d7b0 (seqindex ffff)
lowercase_mapping = d7b0 (seqindex ffff)
titlecase_mapping = d7b0 (seqindex ffff)
casefold = ힰ
comb_index = 65535
bidi_mirrored = 0
comp_exclusion = 0
ignorable = 0
control_boundary = 0
boundclass = 7
charwidth = 1

Will have to open another issue at utf8proc.

Some further discussion:

https://lists.gnu.org/archive/html/bug-libunistring/2021-12/msg00006.html (and replies)
ridiculousfish/widecharwidth#16