cs-au-dk/dk.brics.automaton

Unexpected poor performance with sub string matching in AutomatonMatcher

turf00 opened this issue · 5 comments

My expectation with Brics is that for the following scenario:

  • When matching a sub-string inside an input String with the AutomatonMatcher via a RunAutomaton
  • Regex starts with any character with repetition i.e. .*
  • Input string does not match at all

That the runtime of matching would be equivalent to the number of characters in the input string, i.e. each char would be interrogated once as part of the matching.

E.g.

  public boolean find(final String input) {
    // automaton is of type RunAutomaton built using tabelize=true
    final AutomatonMatcher matcher = automaton.newMatcher(input);
    return matcher.find();
  }

However, this does not appear to be the case at all. The implementation does some form of "backtracking" in the AutomatonMatcher#find() method. Is this a known problem or just expected with the ability to match and return the position of matching.

https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/AutomatonMatcher.java#L94

   public boolean find() {
...
    int l = getChars().length();
		while (begin < l) {
			int p = automaton.getInitialState();
                         // inner loop
                         // will "backtrack per char" if the matching eventually fails
			for (int i = begin; i < l; i++) {
				final int new_state = automaton.step(p, getChars().charAt(i));
				if (new_state == -1) {
				    break;
				} else if (automaton.isAccept(new_state)) {
				    // found a match from begin to (i+1)
				    match_start = begin;
				    match_end=(i+1);
				}
				p = new_state;
			}
			if (match_start != -1) {
				setMatch(match_start, match_end);
				return true;
			}
			begin += 1;
		}
		if (match_start != -1) {
			setMatch(match_start, match_end);
			return true;
		} else {
			setMatch(-2, -2);
			return false;
		}
    }

You see that if a match occurs at position i, then we will continue until the match fails and then restart from position i+1.

Adding my own version of this class, with extra counting and logging I can see the following. We initially noticed this due to "poor" performance in relation to testing some inputs and patterns.

In essence we simply count how often automaton.step is checked.

For a test with:

  • Input string length = 100,000 chars
  • Input will not match
  • Pattern = .*(rhino[£$%])+

Then we see, calls to automaton.step = 5000050000

Which are way more than my expectation of 100k would be?

I was able to approach the performance I would expect by hacking a version of the RunAutomaton.run method that returns early if it reaches an accept state and giving any regex a wildcard match at the start. Obviously that doesn't handle position checking.

Please don't get me wrong though this library is a great tool and we appreciate all the work that went into it.

You might argue that the example is a little dumb as it has .* at the start and we are doing a match but I believe the same problem would also occur with other regexs that use alternations with repetition not at the start of the regex depending on the input string form.

I'm sure I am missing something and will welcome any assistance.

I have tried contacting the author of that code (see the Changelog)... Sorry I don't have time to investigate the issue, but fixes are welcome.

Ok I have a version that works with my limited tests. I would be concerned that it doesn't cover every case or handles start, end differently as I am far from an expert but I will tidy it up and create a PR.

My initial and now seeming naive version does not match correctly in all cases. It seems that the backtracking is required for some cases, its a shame there aren't tests otherwise this would have been made obvious.

I will attempt to continue to look into fixes, it seems according to the re2 docs that you can firstly identify a match using .*<Pattern> as an initial part of the pattern and then do a further match to locate the start, end.

@amoeller I provide a separate class for simply matching input strings without backtracking in this PR: #40

I thought I could associate it in that branch with this issue but that requires write permissions.

The class is MatchOnlyRunAutomaton

Closing this as it was addressed in #40