Forever-Young/mrab-regex-hg

unicode properties containing whitespace; unknown properties

GoogleCodeExporter opened this issue · 4 comments

Hi,
I just encountered possible errors in handling unicode properties in regex via 
\p{...}
I am using regex-2014.11.13 with python 3.4.2 and 2.7 (both 32bit) on win 7 
(Czech).
It seems that the unicode properties containing whitespace are not recognised 
correctly - e.g. the character names.
Furhtermore, it seems, that such properties deemed invalid are not treated as 
expected (I believe, there was a specific error "invalid property" in some 
former regex version, but it is not beeing raised correctly now; cf. 

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:15:05) [MSC v.1600 32 bit 
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import regex
>>> regex.findall(r"(?V1)\p{SPACE}", " 2 ") # OK
[' ', ' ']
>>> regex.findall(r"(?V1)\p{DIGIT TWO}", " 2 ")
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    regex.findall(r"(?V1)\p{DIGIT TWO}", " 2 ")
  File "...\Python34\lib\regex.py", line 318, in findall
    return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
  File "...\Python34\lib\regex.py", line 489, in _compile
    parsed = _parse_pattern(source, info)
  File "...\Python34\lib\_regex_core.py", line 342, in _parse_pattern
    branches = [parse_sequence(source, info)]
  File "...\Python34\lib\_regex_core.py", line 357, in parse_sequence
    info)
  File "...\Python34\lib\_regex_core.py", line 684, in parse_literal_and_element
    element = parse_escape(source, info, False)
  File "...\Python34\lib\_regex_core.py", line 1107, in parse_escape
    return parse_property(source, info, ch == "p", in_set)
  File "...\Python34\lib\_regex_core.py", line 1246, in parse_property
    prop = lookup_property(prop_name, name, positive != negate, source_pos=source.pos)
  File "...\Python34\lib\_regex_core.py", line 1545, in lookup_property
    raise error("unknown property", source.string, source_pos)
NameError: name 'source' is not defined
>>> regex.findall(r"(?V1)\p{NOSUCHPROPERTY}", " 2 ")
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    regex.findall(r"(?V1)\p{NOSUCHPROPERTY}", " 2 ")
... [the same traceback follows - ditto]

The same errors are triggered in py 2.7.

Regards,
    vbr

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 13 Nov 2014 at 12:35

As far as I can tell, it does correctly handle spaces in the property names:

>>> regex.findall(r"(?V1)\p{WHITESPACE}", " 2 ")
[' ', ' ']
>>> regex.findall(r"(?V1)\p{WHITE SPACE}", " 2 ")
[' ', ' ']

As for "DIGIT TWO", there's no such property. There is, however, a codepoint 
with that name:

>>> regex.findall(r"(?V1)\N{DIGIT TWO}", " 2 ")
['2']

But the traceback does reveal a bug. :-(

Original comment by re...@mrabarnett.plus.com on 13 Nov 2014 at 2:14

  • Changed state: Started
  • Added labels: ****
  • Removed labels: ****
Thanks for the clarification - I was mistaken, that the properties and 
character names are somehow treated together. (I forgot the \N{...} literal 
within regex and to add to the confusion, a simple character name without 
whitespace SPACE also appears to be understood as unicode property.)

vbr

Original comment by Vlastimil.Brom@gmail.com on 13 Nov 2014 at 5:48

  • Added labels: ****
  • Removed labels: ****
There's a codepoint called "SPACE" (U+0020) and a property called "Space", 
which is an alias for "White_Space" ("WSpace" is another alias).

You can see the difference between the named codepoint and the property here:

>>> regex.findall(r'\N{Space}', ' \n')
[' ']
>>> regex.findall(r'\p{Space}', ' \n')
[' ', '\n']

Having a property called "space" is a long-standing convention; even Python has 
it:

>>> '\n'.isspace() # Should really be called 'iswhitespace'!
True

Original comment by re...@mrabarnett.plus.com on 13 Nov 2014 at 6:22

  • Added labels: ****
  • Removed labels: ****
Fixed in regex 2014.11.14.

Original comment by re...@mrabarnett.plus.com on 14 Nov 2014 at 12:21

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****