Running time for failing fullmatch increases rapidly with input length

Question

Running time for failing fullmatch increases rapidly with input length

dlukes opened this issue 2 years ago · 5 comments

On Python 3.11.0 with regex 2023.3.22, the running times quickly become unmanageable for the following failing fullmatch with growing input:

import time
import regex as re

toks = "xxx x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\".split()
for i in range(-1, -len(toks) - 1, -1):
    start = time.perf_counter()
    substr = " ".join(toks[i:])
    print(f"Trying to match from token {i} against {substr!r}...", end="")
    re.fullmatch(r"(x+|\s+)*", " ".join(substr))
    print(f" finished in {time.perf_counter() - start:.3f}s.")

Here's some sample output:

Trying to match from token -1 against 'xxx\\'... finished in 0.000s.
Trying to match from token -2 against 'xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -3 against 'x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -4 against 'xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -5 against 'xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.001s.
Trying to match from token -6 against 'x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.003s.
Trying to match from token -7 against 'xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.014s.
Trying to match from token -8 against 'xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.030s.
Trying to match from token -9 against 'xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.117s.
Trying to match from token -10 against 'xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.488s.
Trying to match from token -11 against 'xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 1.898s.
Trying to match from token -12 against 'x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 7.597s.
^CTrying to match from token -13 against 'xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'...Traceback (most recent call last):
  File "/cnk/users/home/lukes/corp/argentina/rerror.py", line 9, in <module>
    re.fullmatch(r"(x+|\s+)*", " ".join(substr))
  File "/home/lukes/.local/mambaforge/envs/umrk/lib/python3.11/site-packages/regex/regex.py", line 261, in fullmatch
    return pat.fullmatch(string, pos, endpos, concurrent, partial, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

But maybe my regex is a known pathological case that I should simply avoid? The builtin re module also exhibits this kind of slowdown, although not as dramatic AFAICS (as in, the -12 variant ran in ~4.5s instead of ~7.5s).

Answer 1 · 2023-03-23T10:02:21.000Z

It's a regression in the newest release.
Output with 2022.10.31 installed:

Trying to match from token -1 against 'xxx\\'... finished in 0.000s.
Trying to match from token -2 against 'xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -3 against 'x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -4 against 'xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -5 against 'xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -6 against 'x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -7 against 'xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -8 against 'xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -9 against 'xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -10 against 'xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -11 against 'xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -12 against 'x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -13 against 'xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -14 against 'xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -15 against 'xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -16 against 'x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -17 against 'x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.
Trying to match from token -18 against 'xxx x x xxxxx xxxx xxxxx x xxxxx xxx xxxx xxxxxx xxx x xxxxxxxxx xxxx x xxxxx xxx\\'... finished in 0.000s.

Answer 2 · 2023-03-23T17:02:49.000Z

It's due to a change to fix issue #494. I thought it better for it to be slower than wrong, and you can add a timeout if necessary. I don't know how easy it'll be to solve the problem; I'm still working on it.

Answer 3 · 2023-03-23T17:09:41.000Z

Sure, that makes sense :) I thought it might be related. I'm indeed using timeout as a workaround. Thank you for the details!

Answer 4 · 2023-03-23T17:27:28.000Z

A better solution is to use possessive repeats and atomic groups to explicitly reduce the chance of backtracking, e.g. r"(x+|\s+)*+" or r"(?>x+|\s+)*". The re module in recent versions of Python support them.

Failing that, use timeouts just in case.

Answer 5 · 2023-03-24T14:28:08.000Z

Thank you very much for those tips, those are extremely useful! Always happy to learn more about lesser known regex features :)