Bug Report: The algorithm fails to match keywords accurately in marginal cases.

Question

Bug Report: The algorithm fails to match keywords accurately in marginal cases.

Aaronong opened this issue 6 years ago · 5 comments

I recently attempted to consume this module. But being the paranoid individual that I am, decided to run a robust test against the algorithm to check for its correctness.

My tests revealed that the algorithm matched keywords incorrectly in marginal cases.

I am willing to contribute the tests that I implemented for this algorithm, and also help figure out where/why it fails.

I am currently using the cloudflare/ahocorasick library because it is accurate 100% of the time, but I really do wish to see the bug in this library fixed because of its major edge in efficiency.

Answer 1 · 2018-07-31T23:35:19.000Z

@Aaronong can you post the test case here? I would like to see how it fails

Answer 2 · 2018-12-04T10:01:47.000Z

can you post your test case, @Aaronong i am also planning use it in my project.

Answer 3 · 2019-03-15T03:02:22.000Z

I find a bad case. keywords=["美凯龙", "红星美凯龙", "徐州红星美凯龙","苏州相城红星美凯龙"], text="徐州红星美凯龙", MultiPatternSearch(,returnImmediately=false) function return "苏州相城红星美凯龙"

Answer 4 · 2019-03-15T18:51:32.000Z

Sorry for dropping off the grid. I got busy and muted github.

I no longer have the test suite, but its pretty simple to implement. I'll provide the pseudocode below which should help you to debug the codebase.

Load dictionary.txt into a set map[string]bool.
Load dictionary.txt into the AC trie.
For each line in text.txt:
1. iterate through each (starting, ending) index pair:
  1. if the substring exists in the set, add the (word, starting_index) pair into your expected results.
2. Run AC search on the line and check if matches your expected results exactly.

Hope this helps. Step 3 is O(n^2) with respect to each line, but since the lines are pretty short anyway, its pretty much a large constant. It was a couple seconds to run when I used it.

Answer 5 · 2019-09-04T02:57:25.000Z

I find a bad case. keywords=["美凯龙", "红星美凯龙", "徐州红星美凯龙","苏州相城红星美凯龙"], text="徐州红星美凯龙", MultiPatternSearch(,returnImmediately=false) function return "苏州相城红星美凯龙"

I also met the same issue and created a PR to fix: #10