mrabarnett/mrab-regex

Running time for failing fullmatch increases rapidly with input length

dlukes opened this issue ยท 5 comments

On Python 3.11.0 with regex 2023.3.22, the running times quickly become unmanageable for the following failing fullmatch with growing input:

import time
import regex as re

toks = "xxx x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\".split()
for i in range(-1, -len(toks) - 1, -1):
    start = time.perf_counter()
    substr = " ".join(toks[i:])
    print(f"Trying to match from token {i} against {substr!r}...", end="")
    re.fullmatch(r"(x+|\s+)*", " ".join(substr))
    print(f" finished in {time.perf_counter() - start:.3f}s.")

Here's some sample output:

Trying to match from token -1 against 'xxx\\'... finished in 0.000s.
Trying to match from token -2 against 'xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -3 against 'x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -4 against 'xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -5 against 'xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.001s.
Trying to match from token -6 against 'x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.003s.
Trying to match from token -7 against 'xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.014s.
Trying to match from token -8 against 'xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.030s.
Trying to match from token -9 against 'xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.117s.
Trying to match from token -10 against 'xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.488s.
Trying to match from token -11 against 'xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 1.898s.
Trying to match from token -12 against 'x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 7.597s.
^CTrying to match from token -13 against 'xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'...Traceback (most recent call last):
  File "/cnk/users/home/lukes/corp/argentina/rerror.py", line 9, in <module>
    re.fullmatch(r"(x+|\s+)*", " ".join(substr))
  File "/home/lukes/.local/mambaforge/envs/umrk/lib/python3.11/site-packages/regex/regex.py", line 261, in fullmatch
    return pat.fullmatch(string, pos, endpos, concurrent, partial, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

But maybe my regex is a known pathological case that I should simply avoid? The builtin re module also exhibits this kind of slowdown, although not as dramatic AFAICS (as in, the -12 variant ran in ~4.5s instead of ~7.5s).

It's a regression in the newest release.
Output with 2022.10.31 installed:

Trying to match from token -1 against 'xxx\\'... finished in 0.000s.
Trying to match from token -2 against 'xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -3 against 'x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -4 against 'xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -5 against 'xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -6 against 'x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -7 against 'xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -8 against 'xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -9 against 'xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -10 against 'xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -11 against 'xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -12 against 'x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -13 against 'xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -14 against 'xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -15 against 'xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -16 against 'x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -17 against 'x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -18 against 'xxx x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.

It's due to a change to fix issue #494. I thought it better for it to be slower than wrong, and you can add a timeout if necessary. I don't know how easy it'll be to solve the problem; I'm still working on it.

Sure, that makes sense :) I thought it might be related. I'm indeed using timeout as a workaround. Thank you for the details!

A better solution is to use possessive repeats and atomic groups to explicitly reduce the chance of backtracking, e.g. r"(x+|\s+)*+" or r"(?>x+|\s+)*". The re module in recent versions of Python support them.

Failing that, use timeouts just in case.

Thank you very much for those tips, those are extremely useful! Always happy to learn more about lesser known regex features :)