Forever-Young/mrab-regex-hg

DEBUG flag

Closed this issue · 5 comments

Hi,
in context of some tests for Issue 74, I noticed some problems with 
regex.DEBUG output on several patterns; cf.:

# ok
>>> regex.match(ur"a(b)[X](d) ", u"abXd ", regex.DEBUG)
CHARACTER MATCH 97
GROUP 1
  CHARACTER MATCH 98
CHARACTER MATCH 88
GROUP 2
  CHARACTER MATCH 100
CHARACTER MATCH 32
<_regex.Match object at 0x035E8640>

# multiple-character content of a character set cannot be displayed due to 
formatting issues
>>> regex.match(ur"a(b)[Xx](d) ", u"abXd ", regex.DEBUG)
CHARACTER MATCH 97
GROUP 1
  CHARACTER MATCH 98
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 231, in match
  File "regex.pyc", line 487, in _compile
  File "_regex_core.pyc", line 3090, in dump
  File "_regex_core.pyc", line 3176, in dump
TypeError: not enough arguments for format string
>>> 

# both patterns match without DEBUG
>>> regex.match(ur"a(b)[X](d) ", u"abXd ")
<_regex.Match object at 0x035E8678>
>>> regex.match(ur"a(b)[Xx](d) ", u"abXd ")
<_regex.Match object at 0x035E86B0>
>>> 

# finally, the DEBUG flag for the previously working pattern seems to be 
ignored afterwards (possibly some internal state isn't set?) and only the match 
is returned without further information.

>>> regex.match(ur"a(b)[X](d) ", u"abXd ", regex.DEBUG)
<_regex.Match object at 0x035E8640>
>>> 

(The bug-inducing pattern is consistent on further trials with DEBUG and throws 
the same exception.)

(regex-0.1.20120708, py 2.7.2, win XPh)

regards,
   vbr

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 9 Jul 2012 at 10:02

Fixed in regex 0.1.20120709.

I've also made the debug output a little more readable by showing string 
literals and property names/values.

Original comment by re...@mrabarnett.plus.com on 9 Jul 2012 at 7:07

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****
Thanks for the fix and the enhanced debug output, it is indeed much more 
readable and informative.
Can I read the meaning of the codes somewhere in the source or somewhere else?
e.g. "S", like in:
>>> regex.match("(?:abc)", "abc", regex.DEBUG)
S 'abc'
<_regex.Match object at 0x03625950>
(maybe substring?/subpattern?)

I guess the debug output cannot be obtained as a string variable or container; 
is replacing the sys.stdout temporarily the expected way to get this output?

Is it specified, which information is output for debug, and which is not?
Would it be (in principle) possible to build a functionally equivalent pattern 
based on the debug output?  

(Of course, the exact pattern isn't guaranteed, as there might be synonyms like 
the quantifiers ? {0,1}, or any number of the irrelevant parts like bare non 
capturing parens etc.)

---

Anyway, I noticed a possible problem with repeated calls using the same pattern;

Could it be, that the debug flag is somehow ignored in subsequent runs using 
the same pattern? Maybe due to some internal chaching?

>>> regex.match(r"AB?", "AB", regex.DEBUG)
CHARACTER MATCH 'A'
GREEDY_REPEAT 0 1
  CHARACTER MATCH 'B'
<_regex.Match object at 0x036256E8>
>>> regex.match(r"AB", "AB", regex.DEBUG)
S 'AB'
<_regex.Match object at 0x03625950>
>>> regex.match(r"A", "A", regex.DEBUG)
CHARACTER MATCH 'A'
<_regex.Match object at 0x036256E8>
>>> 
>>> 
>>> 
>>> # repeated calls of the same patterns discard the debug info
>>> 
>>> regex.match(r"AB", "AB", regex.DEBUG)
<_regex.Match object at 0x03625950>
>>> regex.match(r"A", "A", regex.DEBUG)
<_regex.Match object at 0x036256E8>
>>> regex.match(r"AB?", "AB", regex.DEBUG)
<_regex.Match object at 0x03625800>
>>>

(regex 0.1.20120709, Python 2.7.2, Win XP)

regards,
  vbr

Original comment by Vlastimil.Brom@gmail.com on 10 Jul 2012 at 3:02

  • Added labels: ****
  • Removed labels: ****
The "S" is a bug in the Python 2 version. :-(

The debug output is generated only when the pattern is compiled. Patterns are 
cached, so the debug output won't be generated again for that pattern (with 
that set of flags) because it won't be compiled again, just looked up. You 
still have the ability to purge the cache. Incidentally, the re module does the 
same!

This generating of debug output is based on the re module. When the re module 
was written, I don't think the author expected that anyone would want to 
capture it.

I believe that the only information missing from the debug output is whether 
it's for ASCII, LOCALE or UNICODE.

Original comment by re...@mrabarnett.plus.com on 10 Jul 2012 at 3:49

  • Added labels: ****
  • Removed labels: ****
Fixed in regex 0.1.20120710.

Original comment by re...@mrabarnett.plus.com on 10 Jul 2012 at 4:48

  • Added labels: ****
  • Removed labels: ****
Thanks for the fix.
I already found the opcodes in _regex_core.py and could try to infer the 
correspondence to the pattern parts and see, how far I can go either this way 
or using direct regex matching on the pattern string.

Thanks,
 vbr

Original comment by Vlastimil.Brom@gmail.com on 11 Jul 2012 at 10:05

  • Added labels: ****
  • Removed labels: ****