Name2CType data wrong for many Indic scripts?
deepestblue opened this issue · 2 comments
I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.
irb(main):002:0> "\u0BAE\u0BC0\u0BA9\u0BCD\u0BA9".scan(/[[:alpha:]]+/).each { |s| puts s.dump }
"\u0BAE\u0BC0\u0BA9"
"\u0BA9"
=> ["மீன", "ன"]
irb(main):003:0>
Notice that both \u0BC0
and \u0BCD
are combining vowel markers in the Mark, Nonspacing [Mn]
character category, which should match the [:alpha:]
class. But \u0BCD
does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except in name2ctype.h
in CR_Alpha
, CR_Alnum
, etc.
0x0bca, 0x0bcc,
0x0c01, 0x0c03,
Notice the missing 0x0bcd
.
P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!
See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below:
The two characters in question are (I have marked some interesting things in bold):
- U+0BC0 Tamil Vowel Sign II, with the following (relevant) properties:
- U+0BCD Tamil Sign Virama, with the following (relevant) properties:
The Ruby documentation for the Regexp
class does not explicitly spell out what [[:alpha:]]
matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]]
as an example, saying it matches anything with the Unicode property Nd (Decimal Number).
While not explicitly documented, it makes sense to equate the Regexp
POSIX bracket expression [[:alpha:]]
with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.
On the other hand, the documentation for Onigmo does explicitly specify the workings of [[:alpha:]]
. In fact, it specifies it in two different places, and they contradict each other:
- In
doc/RE
, it says that[[:alpha:]]
matches Letter | Mark. - In
doc/UnicodeProps.txt
, it seems to imply that[[:alpha:]]
matches Alphabetic.
So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]
. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.
Thanks, Joerg.
While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.
Given [[:digit:]]
matches Unicode category Nd
, for the sake of consistency I'd rather [[:alpha:]]
match the union of Unicode category Letter
and Unicode category Mark
, rather than Unicode property Alphabetic
.