Unicode differences between re2 and re?
turian opened this issue · 3 comments
turian commented
I am seeing difference betweens re2 and re when there is re.UNICODE being using.
I am not able to get re2 to detect Unicode alphabetic characters, even when I encode to UTF-8.
Here is an example:
In [24]: print u'\xe8'.encode("utf-8")
è
In [25]: re.compile('[^\W]', re.UNICODE).search(u'\xe8')
Out[25]: <_sre.SRE_Match object at 0x1186850>
In [26]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8')
In [27]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8'.encode("utf-8"))
itsadok commented
This is a glaring omission in prepare_pattern: we only handle \d
, \w
and \s
, but not the corresponding \D
, \W
and \S
. I'll try to find some time to fix it.
turian commented
Please.
axiak commented
We had an issue with \W, \D and \S that itsadok just fixed and I pushed out. However, I think there are still unicode issues as the groups in issue #4 don't match up quite right (I added it as a test). Please pull the latest version and see if it works for you as I try to see why the test is failing.