Bug Report: The algorithm fails to match keywords accurately in marginal cases.
Aaronong opened this issue · 5 comments
I recently attempted to consume this module. But being the paranoid individual that I am, decided to run a robust test against the algorithm to check for its correctness.
My tests revealed that the algorithm matched keywords incorrectly in marginal cases.
I am willing to contribute the tests that I implemented for this algorithm, and also help figure out where/why it fails.
I am currently using the cloudflare/ahocorasick library because it is accurate 100% of the time, but I really do wish to see the bug in this library fixed because of its major edge in efficiency.
can you post your test case, @Aaronong i am also planning use it in my project.
I find a bad case. keywords=["美凯龙", "红星美凯龙", "徐州红星美凯龙","苏州相城红星美凯龙"], text="徐州红星美凯龙", MultiPatternSearch(,returnImmediately=false) function return "苏州相城红星美凯龙"
Sorry for dropping off the grid. I got busy and muted github.
I no longer have the test suite, but its pretty simple to implement. I'll provide the pseudocode below which should help you to debug the codebase.
- Load
dictionary.txt
into a setmap[string]bool
. - Load
dictionary.txt
into the AC trie. - For each line in
text.txt
:- iterate through each (starting, ending) index pair:
- if the substring exists in the set, add the (word, starting_index) pair into your expected results.
- Run AC search on the line and check if matches your expected results exactly.
- iterate through each (starting, ending) index pair:
Hope this helps. Step 3 is O(n^2) with respect to each line, but since the lines are pretty short anyway, its pretty much a large constant. It was a couple seconds to run when I used it.