unicode properties containing whitespace; unknown properties
GoogleCodeExporter opened this issue · 4 comments
GoogleCodeExporter commented
Hi,
I just encountered possible errors in handling unicode properties in regex via
\p{...}
I am using regex-2014.11.13 with python 3.4.2 and 2.7 (both 32bit) on win 7
(Czech).
It seems that the unicode properties containing whitespace are not recognised
correctly - e.g. the character names.
Furhtermore, it seems, that such properties deemed invalid are not treated as
expected (I believe, there was a specific error "invalid property" in some
former regex version, but it is not beeing raised correctly now; cf.
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import regex
>>> regex.findall(r"(?V1)\p{SPACE}", " 2 ") # OK
[' ', ' ']
>>> regex.findall(r"(?V1)\p{DIGIT TWO}", " 2 ")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
regex.findall(r"(?V1)\p{DIGIT TWO}", " 2 ")
File "...\Python34\lib\regex.py", line 318, in findall
return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
File "...\Python34\lib\regex.py", line 489, in _compile
parsed = _parse_pattern(source, info)
File "...\Python34\lib\_regex_core.py", line 342, in _parse_pattern
branches = [parse_sequence(source, info)]
File "...\Python34\lib\_regex_core.py", line 357, in parse_sequence
info)
File "...\Python34\lib\_regex_core.py", line 684, in parse_literal_and_element
element = parse_escape(source, info, False)
File "...\Python34\lib\_regex_core.py", line 1107, in parse_escape
return parse_property(source, info, ch == "p", in_set)
File "...\Python34\lib\_regex_core.py", line 1246, in parse_property
prop = lookup_property(prop_name, name, positive != negate, source_pos=source.pos)
File "...\Python34\lib\_regex_core.py", line 1545, in lookup_property
raise error("unknown property", source.string, source_pos)
NameError: name 'source' is not defined
>>> regex.findall(r"(?V1)\p{NOSUCHPROPERTY}", " 2 ")
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
regex.findall(r"(?V1)\p{NOSUCHPROPERTY}", " 2 ")
... [the same traceback follows - ditto]
The same errors are triggered in py 2.7.
Regards,
vbr
Original issue reported on code.google.com by Vlastimil.Brom@gmail.com
on 13 Nov 2014 at 12:35
GoogleCodeExporter commented
As far as I can tell, it does correctly handle spaces in the property names:
>>> regex.findall(r"(?V1)\p{WHITESPACE}", " 2 ")
[' ', ' ']
>>> regex.findall(r"(?V1)\p{WHITE SPACE}", " 2 ")
[' ', ' ']
As for "DIGIT TWO", there's no such property. There is, however, a codepoint
with that name:
>>> regex.findall(r"(?V1)\N{DIGIT TWO}", " 2 ")
['2']
But the traceback does reveal a bug. :-(
Original comment by re...@mrabarnett.plus.com
on 13 Nov 2014 at 2:14
- Changed state: Started
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Thanks for the clarification - I was mistaken, that the properties and
character names are somehow treated together. (I forgot the \N{...} literal
within regex and to add to the confusion, a simple character name without
whitespace SPACE also appears to be understood as unicode property.)
vbr
Original comment by Vlastimil.Brom@gmail.com
on 13 Nov 2014 at 5:48
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
There's a codepoint called "SPACE" (U+0020) and a property called "Space",
which is an alias for "White_Space" ("WSpace" is another alias).
You can see the difference between the named codepoint and the property here:
>>> regex.findall(r'\N{Space}', ' \n')
[' ']
>>> regex.findall(r'\p{Space}', ' \n')
[' ', '\n']
Having a property called "space" is a long-standing convention; even Python has
it:
>>> '\n'.isspace() # Should really be called 'iswhitespace'!
True
Original comment by re...@mrabarnett.plus.com
on 13 Nov 2014 at 6:22
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Fixed in regex 2014.11.14.
Original comment by re...@mrabarnett.plus.com
on 14 Nov 2014 at 12:21
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****