mrabarnett/mrab-regex

Regex fails Unicode 15.1 GraphemeBreakTest due to missing new GB9c rule implementation

Closed this issue · 1 comments

The "\X" (extended grapheme cluster) cannot pass Unicode's 15.1.0 GraphemeBreakTest because the grapheme matcher no longer conforms to the Unicode specification at https://www.unicode.org/reports/tr29/tr29-43.html (revision 43).

In Unicode version 15.1, a new rule, GB9c, was introduced. This rule states: "Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker." Unfortunately, this rule was not incorporated during the update for Unicode 15.1 in the regex module version 2023.10.3.

Consequently, the test fails for the following lines in the GraphemeBreakTest file:

Line 1202: ÷ 0915 × 094D × 0924 ÷
Line 1203: ÷ 0915 × 094D × 094D × 0924 ÷
Line 1204: ÷ 0915 × 094D × 200D × 0924 ÷
Line 1205: ÷ 0915 × 093C × 200D × 094D × 0924 ÷
Line 1206: ÷ 0915 × 093C × 094D × 200D × 0924 ÷
Line 1207: ÷ 0915 × 094D × 0924 × 094D × 092F ÷
Line 1211: ÷ 0915 × 094D × 094D × 0924 ÷

Please note that line 1211 is identical to line 1203.

Thus, for example, हिन्दी is three 15.0 extended grapheme clusters: ["\u0939\u093F", "\u0928\u094D", "\u0926\u0940"], whereas it is two 15.1 extended grapheme clusters: ["\u0939\u093F", "\u0928\u094D\u0926\u0940"].

Fixed in regex 2024.6.22.