k-takata/Onigmo

Support 31bit codepoint with UTF-8

omochi opened this issue · 2 comments

I am using Onigmo with UTF-8.
If I pass regex \x{7fffffff}, Onigmo fails to compile and return ONIGERR_TOO_BIG_WIDE_CHAR_VALUE.

It is reasonable since UTF-8 support 21bit codepoint at most with 4 bytes.

But there are such regex in the world.

https://github.com/Microsoft/vscode/blob/772aaf777a2e6b50c5c2e53da1a0955d2cb73a4d/extensions/php/syntaxes/php.tmLanguage.json#L26

This syntax file is problem for me actually now in my project.

https://github.com/omochi/TMSyntax/blob/857a6fbab4d998946351c98c19d59390cabe7cca/Tests/TMSyntaxTests/TestCase/ParserTests.swift#L213

I am using Swift.
Swift is optimized for UTF-8 in C interop.
So I don't want to use UTF-16.

If we extend UTF-8 codec to 6 bytes,
it can use 31bit codepoint at most.

So I want this improvement.
But this extension is true UTF-8.
There is design problem and need decision here.
How do you think?

If I implement this extension,
will you merge it?

Oniguruma originally supported 31-bit codepoints, however Ruby limited it to 21-bit codepoints. So this is intentional. If you want to support 31-bit codepoints, some kind of configuration is needed.

Thanks to response and information.
I will try to implement it configurable by #define.