Support 31bit codepoint with UTF-8
omochi opened this issue · 2 comments
I am using Onigmo with UTF-8.
If I pass regex \x{7fffffff}
, Onigmo fails to compile and return ONIGERR_TOO_BIG_WIDE_CHAR_VALUE
.
It is reasonable since UTF-8 support 21bit codepoint at most with 4 bytes.
But there are such regex in the world.
This syntax file is problem for me actually now in my project.
I am using Swift.
Swift is optimized for UTF-8 in C interop.
So I don't want to use UTF-16.
If we extend UTF-8 codec to 6 bytes,
it can use 31bit codepoint at most.
So I want this improvement.
But this extension is true UTF-8.
There is design problem and need decision here.
How do you think?
If I implement this extension,
will you merge it?
Oniguruma originally supported 31-bit codepoints, however Ruby limited it to 21-bit codepoints. So this is intentional. If you want to support 31-bit codepoints, some kind of configuration is needed.
Thanks to response and information.
I will try to implement it configurable by #define
.