axiak/pyre2

Unicode differences between re2 and re?

turian opened this issue · 3 comments

I am seeing difference betweens re2 and re when there is re.UNICODE being using.

I am not able to get re2 to detect Unicode alphabetic characters, even when I encode to UTF-8.

Here is an example:

In [24]: print u'\xe8'.encode("utf-8")
è

In [25]: re.compile('[^\W]', re.UNICODE).search(u'\xe8')
Out[25]: <_sre.SRE_Match object at 0x1186850>

In [26]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8')

In [27]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8'.encode("utf-8"))

This is a glaring omission in prepare_pattern: we only handle \d, \w and \s, but not the corresponding \D, \W and \S. I'll try to find some time to fix it.

Please.

We had an issue with \W, \D and \S that itsadok just fixed and I pushed out. However, I think there are still unicode issues as the groups in issue #4 don't match up quite right (I added it as a test). Please pull the latest version and see if it works for you as I try to see why the test is failing.