/regex

Regular expression engine in Python using Thompson's algorithm.

Primary LanguagePythonMIT LicenseMIT

A regex engine in Python following Thompson's Algorithm. This will perform significantly better than the backtracking approach implemented in Python's re module on some pathological patterns.

I wrote a blog post about this project here

It has the same interface as Python's re module:

import regex
n = 20
p = 'a?' * n + 'a' * n
nfa = regex.compile(p)
input_string = 'a' * n
matched = nfa.match(input_string)
print(matched) # True

Currently it supports the following:

  • Repetition operators: * + ?
  • Parenthesis
  • Characters (no character sets)

regex.py is the interface, parse.py holds main implementation logic, sample.py shows some samples.

You can run python3 testing.py -v to ensure it passes all test cases in test_suite.dat

Performance

This regex engine underperforms Python's re module on normal inputs (using Glenn Fowler's test suite -- see below), however it outperforms significantly on pathological inputs.

normal pathological

Credits

  • Test suite is based on Glenn Fowler's regex test suites.

  • Russ Cox has an excellent collection of articles on regex.