anknown/ahocorasick

Bug Report: The algorithm fails to match keywords accurately in marginal cases.

Aaronong opened this issue · 5 comments

I recently attempted to consume this module. But being the paranoid individual that I am, decided to run a robust test against the algorithm to check for its correctness.

My tests revealed that the algorithm matched keywords incorrectly in marginal cases.

I am willing to contribute the tests that I implemented for this algorithm, and also help figure out where/why it fails.

I am currently using the cloudflare/ahocorasick library because it is accurate 100% of the time, but I really do wish to see the bug in this library fixed because of its major edge in efficiency.

@Aaronong can you post the test case here? I would like to see how it fails

can you post your test case, @Aaronong i am also planning use it in my project.

I find a bad case. keywords=["美凯龙", "红星美凯龙", "徐州红星美凯龙","苏州相城红星美凯龙"], text="徐州红星美凯龙", MultiPatternSearch(,returnImmediately=false) function return "苏州相城红星美凯龙"

Sorry for dropping off the grid. I got busy and muted github.

I no longer have the test suite, but its pretty simple to implement. I'll provide the pseudocode below which should help you to debug the codebase.

  1. Load dictionary.txt into a set map[string]bool.
  2. Load dictionary.txt into the AC trie.
  3. For each line in text.txt:
    1. iterate through each (starting, ending) index pair:
      1. if the substring exists in the set, add the (word, starting_index) pair into your expected results.
    2. Run AC search on the line and check if matches your expected results exactly.

Hope this helps. Step 3 is O(n^2) with respect to each line, but since the lines are pretty short anyway, its pretty much a large constant. It was a couple seconds to run when I used it.

I find a bad case. keywords=["美凯龙", "红星美凯龙", "徐州红星美凯龙","苏州相城红星美凯龙"], text="徐州红星美凯龙", MultiPatternSearch(,returnImmediately=false) function return "苏州相城红星美凯龙"

I also met the same issue and created a PR to fix: #10