qinwf/re2r

Rcpp exception with UTF-8 strings on Windows

Opened this issue · 5 comments

qinwf commented

This Rcpp issue will affect the error message for regular expression.

re2("this (is 测试")
#> Error: missing closing ): this (is 娴嬭瘯 

Here is an issue about related to this before.

[Rcpp-devel] Unicode on windows 1

[Rcpp-devel] Unicode on windows 2

The solution in the above mailing list posts can not solve the exception handling string problem.

I send an email to the Rcpp mailing list about this issue, and here is links to the discussion:

[Rcpp-devel] Rcpp exception with UTF-8 strings on Windows 1

[Rcpp-devel] Rcpp exception with UTF-8 strings on Windows 2

It seems that Rcpp will not fix this very soon. So I suggest to use the origin R-C API to rewrite existing codes.

Just take a look at the way I handle UTF8 string input in stringi. It's pretty simple.
I suggest you LinkingTo: stringi, call stri_enc_toutf8 on a given SEXP object and then play with STRING_ELT etc. on the resulting SEXP.

qinwf commented

Yes, I imported stringi and all of the input strings are processed by stri_enc_toutf8.

qinwf commented

I opened a PR in Rcpp repo to make this fixable with a macro in Rcpp and it was merged.

that's great that you contributed some code to Rcpp! Good job!

Now when you use the new macro this issue is fixed, right?

So now I would say we can keep the Rcpp interface, right? (we don't need to consider re-writing the re2r interface to use the standard Rinternals.h headers)