
Python module to generate regular all expression matches

Quick Start

The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently. It uses the parsed regular expression, so you get a much more accurate result than trying to just split strings.

>>> s = 'foo|ba[rz]'
>>> s.split('|')  # bad
['foo', 'ba[rz]']

>>> import sre_yield
>>> list(sre_yield.AllStrings(s))  # better
['foo', 'bar', 'baz']

It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']

...and happens in simpler cases too:

>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']
>>> list(sre_yield.AllStrings('[aa]'))
['a', 'a']


The membership check, 'abc' in values_obj is by necessity fullmatch -- it must cover the entire string. Imagine that it has ^(...)$ around it. Because re.search can match anywhere in an arbitrarily string, emulating this would produce a large number of junk matches -- probably not what you want. (If that is what you want, add a .* on either side.)

Here's a quick example, using the presidents regex from http://xkcd.com/1313/

>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'

>>> import re
>>> re.search(s, 'kennedy') is not None  # note .search
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
>>> 'bu' in v
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']

If you do want to emulate search, you end up with a large number of matches quickly. Limiting the repetition a bit helps, but it's still a very large number.

>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> v2.__len__()  # too big for int
>>> 'kennedy' in v2

Capturing Groups

If you're interested in extracting what would match during generation of a value, you can use AllMatches instead to get Match objects.

>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
>>> m.group(1)

This even works for simplistic backreferences, in this case to have matching quotes.

>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
>>> m.groups()
('"', '000')
>>> m.group(1)
>>> m.group(2)

Reporting Bugs, etc.

We welcome bug reports -- see our issue tracker on GitHub to see if it's been reported before. If you'd like to discuss anything, we have a Google Group as well.

Differences between sre_yield and the re module

There are certainly valid regular expressions which sre_yield does not handle. These include things like lookarounds, backreferences, but also a few other exceptions:

  • The maximum value for repeats is system-dependant -- CPython's sre module there's a special value which is treated as infinite (either 2**16-1 or 2**32-1 depending on build). In sre_yield, this is taken as a literal, rather than infinite, thus (on a 2**16-1 platform):

    >>> len(sre_yield.AllStrings('a*')[-1])
    >>> import re
    >>> len(re.match('.*', 'a' * 100000).group(0))
  • The re module docs say "Regular expression pattern strings may not contain null bytes" yet this appears to work fine.

  • Order does not depend on greediness.

  • The regex is treated as fullmatch.

  • sre_yield is confused by complex uses of anchors, but support simple ones:

    >>> list(sre_yield.AllStrings('foo$'))
    >>> list(sre_yield.AllStrings('^$'))
    >>> list(sre_yield.AllStrings('.\\b.'))
    Traceback (most recent call last):
    ParseError: Non-end-anchor None found at END state