BurntSushi/aho-corasick

matching only whole words

petar-dambovaliev opened this issue · 8 comments

Does it make sense to add only matching whole words as an option?

No. Use a regex.

No. Use a regex.

How would you efficiently use a regex where the patterns are 100k+ and you have 100k+ haystacks, where the haystack size is up to 1mb?

I am not sure if i was clear enough.

 patterns = ["bea"]
 haystack = "bear and masha"

Currently, with this input, there would be a match "bea" with starting position 0.
What i need is to not match patterns where the pattern ends before a word boundry in the haystack.

How would you efficiently use a regex where the patterns are 100k+ and you have 100k+ haystacks, where the haystack size is up to 1mb?

You could try regex-automata to build the regex offline. I'm not sure how practical it would be though. 100,000 is a lot. Maybe Hyperscan would be better suited for this task.

Currently, with this input, there would be a match "bea" with starting position 0.
What i need is to not match patterns where the pattern ends before a word boundry in the haystack.

Yes, I understood the request. It will not be added to this library. This library is about literal searching. It's not about incrementally adding little regex features. If you want regex features, then use a regex. If that's not good enough, then you'll need to build something bespoke or find a way to adapt an existing solution to your needs.

@petar-dambovaliev One reason why this sort of desire comes up is because you're searching with a dictionary of small words and large words. In the default non-overlapping formulation of Aho-Corasick, the small words will match inside of the bigger words, thus preventing all matches of bigger words that contain smaller words. You have a couple of choices for how to deal with this:

  • Use leftmost-longest match semantics. This is a feature unique to this Aho-Corasick implementation.
  • Use overlapping matches along with a buffer of matches. Then fix up the matches after the fact, giving priority to the matches you want.

If this doesn't match the reason why you want word boundaries, then it probably won't be applicable. But it's impossible for me to know because you haven't told me what the actual problem you're trying to solve is. Instead, you came to me with a particular solution. So now I'm forced to guess what it is you're actually trying to do.

@petar-dambovaliev One reason why this sort of desire comes up is because you're searching with a dictionary of small words and large words. In the default non-overlapping formulation of Aho-Corasick, the small words will match inside of the bigger words, thus preventing all matches of bigger words that contain smaller words. You have a couple of choices for how to deal with this:

  • Use leftmost-longest match semantics. This is a feature unique to this Aho-Corasick implementation.
  • Use overlapping matches along with a buffer of matches. Then fix up the matches after the fact, giving priority to the matches you want.

If this doesn't match the reason why you want word boundaries, then it probably won't be applicable. But it's impossible for me to know because you haven't told me what the actual problem you're trying to solve is. Instead, you came to me with a particular solution. So now I'm forced to guess what it is you're actually trying to do.

Basically, i have a collection of terms (patterns) that consist of one or more words and texts (haystacks) in which i need to find their exact matches. The reason is, i am replacing those with a special hypertext around the matches inside the original text.

Example: If you have terms as ["mary", "little lamb"]
mary had a little lamb would become <mytag>mary</mytag> had a <mytag>little lamb</mytag>

I am already using the leftmost-longest match semantics and that solves another set of problems for my use case.
However, it's still matching terms like i described above.
The simplest solution for me is to simply make sure that the next character is space or end of the string.
Yes, it's not elegant but it probably add negligible overhead.

The simplest solution for me is to simply make sure that the next character is space or end of the string.
Yes, it's not elegant but it probably add negligible overhead.

I agree. You could make it a little better by checking it with the regex \w|$.

Thanks for describing your problem. It would definitely help to just start with that next time. :-)

The simplest solution for me is to simply make sure that the next character is space or end of the string.
Yes, it's not elegant but it probably add negligible overhead.

I agree. You could make it a little better by checking it with the regex \w|$.

Thanks for describing your problem. It would definitely help to just start with that next time. :-)

Thanks for the awesome library though.
I realize that my use case might be too specific and not being worth adding to the library, specially if it's going outside of the scope.

Yeah. Specifically, adding functionality like this would basically require a rewrite of the entire library to the point where it really couldn't be called "aho corasick" any more. The Aho-Corasick algorithm is highly specialized for literal patterns specifically.