DEBUG flag
Closed this issue · 5 comments
GoogleCodeExporter commented
Hi,
in context of some tests for Issue 74, I noticed some problems with
regex.DEBUG output on several patterns; cf.:
# ok
>>> regex.match(ur"a(b)[X](d) ", u"abXd ", regex.DEBUG)
CHARACTER MATCH 97
GROUP 1
CHARACTER MATCH 98
CHARACTER MATCH 88
GROUP 2
CHARACTER MATCH 100
CHARACTER MATCH 32
<_regex.Match object at 0x035E8640>
# multiple-character content of a character set cannot be displayed due to
formatting issues
>>> regex.match(ur"a(b)[Xx](d) ", u"abXd ", regex.DEBUG)
CHARACTER MATCH 97
GROUP 1
CHARACTER MATCH 98
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "regex.pyc", line 231, in match
File "regex.pyc", line 487, in _compile
File "_regex_core.pyc", line 3090, in dump
File "_regex_core.pyc", line 3176, in dump
TypeError: not enough arguments for format string
>>>
# both patterns match without DEBUG
>>> regex.match(ur"a(b)[X](d) ", u"abXd ")
<_regex.Match object at 0x035E8678>
>>> regex.match(ur"a(b)[Xx](d) ", u"abXd ")
<_regex.Match object at 0x035E86B0>
>>>
# finally, the DEBUG flag for the previously working pattern seems to be
ignored afterwards (possibly some internal state isn't set?) and only the match
is returned without further information.
>>> regex.match(ur"a(b)[X](d) ", u"abXd ", regex.DEBUG)
<_regex.Match object at 0x035E8640>
>>>
(The bug-inducing pattern is consistent on further trials with DEBUG and throws
the same exception.)
(regex-0.1.20120708, py 2.7.2, win XPh)
regards,
vbr
Original issue reported on code.google.com by Vlastimil.Brom@gmail.com
on 9 Jul 2012 at 10:02
GoogleCodeExporter commented
Fixed in regex 0.1.20120709.
I've also made the debug output a little more readable by showing string
literals and property names/values.
Original comment by re...@mrabarnett.plus.com
on 9 Jul 2012 at 7:07
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Thanks for the fix and the enhanced debug output, it is indeed much more
readable and informative.
Can I read the meaning of the codes somewhere in the source or somewhere else?
e.g. "S", like in:
>>> regex.match("(?:abc)", "abc", regex.DEBUG)
S 'abc'
<_regex.Match object at 0x03625950>
(maybe substring?/subpattern?)
I guess the debug output cannot be obtained as a string variable or container;
is replacing the sys.stdout temporarily the expected way to get this output?
Is it specified, which information is output for debug, and which is not?
Would it be (in principle) possible to build a functionally equivalent pattern
based on the debug output?
(Of course, the exact pattern isn't guaranteed, as there might be synonyms like
the quantifiers ? {0,1}, or any number of the irrelevant parts like bare non
capturing parens etc.)
---
Anyway, I noticed a possible problem with repeated calls using the same pattern;
Could it be, that the debug flag is somehow ignored in subsequent runs using
the same pattern? Maybe due to some internal chaching?
>>> regex.match(r"AB?", "AB", regex.DEBUG)
CHARACTER MATCH 'A'
GREEDY_REPEAT 0 1
CHARACTER MATCH 'B'
<_regex.Match object at 0x036256E8>
>>> regex.match(r"AB", "AB", regex.DEBUG)
S 'AB'
<_regex.Match object at 0x03625950>
>>> regex.match(r"A", "A", regex.DEBUG)
CHARACTER MATCH 'A'
<_regex.Match object at 0x036256E8>
>>>
>>>
>>>
>>> # repeated calls of the same patterns discard the debug info
>>>
>>> regex.match(r"AB", "AB", regex.DEBUG)
<_regex.Match object at 0x03625950>
>>> regex.match(r"A", "A", regex.DEBUG)
<_regex.Match object at 0x036256E8>
>>> regex.match(r"AB?", "AB", regex.DEBUG)
<_regex.Match object at 0x03625800>
>>>
(regex 0.1.20120709, Python 2.7.2, Win XP)
regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 10 Jul 2012 at 3:02
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
The "S" is a bug in the Python 2 version. :-(
The debug output is generated only when the pattern is compiled. Patterns are
cached, so the debug output won't be generated again for that pattern (with
that set of flags) because it won't be compiled again, just looked up. You
still have the ability to purge the cache. Incidentally, the re module does the
same!
This generating of debug output is based on the re module. When the re module
was written, I don't think the author expected that anyone would want to
capture it.
I believe that the only information missing from the debug output is whether
it's for ASCII, LOCALE or UNICODE.
Original comment by re...@mrabarnett.plus.com
on 10 Jul 2012 at 3:49
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Fixed in regex 0.1.20120710.
Original comment by re...@mrabarnett.plus.com
on 10 Jul 2012 at 4:48
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Thanks for the fix.
I already found the opcodes in _regex_core.py and could try to infer the
correspondence to the pattern parts and see, how far I can go either this way
or using direct regex matching on the pattern string.
Thanks,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 11 Jul 2012 at 10:05
- Added labels: ****
- Removed labels: ****