WojciechMula/pyahocorasick

new 'iter_long' match misses after the recent fix

vegarden opened this issue · 7 comments

After fix #174 which is discussed in #133.
I found the code below

source_text = 'zzabcabdzz'
A = ahocorasick.Automaton()
for k in ['ab', 'abcabd']:
    A.add_word(k, k)

A.make_automaton()
 
list(A.iter_long(source_text))

matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4.
The behavior before seems to be right in this situation.

This happens for me as well. 1.4.4 worked fine, 2.0.0 misses the longest match!

This is indeed a bug introduced by a previous fix for iter_long missing legitimate short matches.
I attempt a fix #186 and add more test cases, but cross my fingers for not introducing more bugs. Can someone help review the pr? @pombredanne
Meanwhile it would be nice if anyone can test my branch and report success/bugs.

The idea is before you fail over to the suffix node, perform a check to see whether there is possibly longer match ahead the current node. If there is a longer match, we should give up the suffix failover.

ehhh, it does not fully work, since it's not enough to just look ahead one step. You need to look ahead all possible longer matches before deciding whether to fail over. Efficiency-wise that's not acceptable.
Give me some time to think it over, I hope it does not require a overhaul to fully resolve this bug.

#187 Come up with another fix, hopefully it works this time. Feedback welcomed.
Note that in case of two partially overlapping strings it can't return the longest match, but the first match.
@pombredanne mind to review?

Anyone tested this fix?