anknown/ahocorasick

Matching whole words in the middle of a longer string

ibrierley opened this issue · 3 comments

Hi, I have seen the issue at #4

But ExactSearch just seems to try and match a single word with a single word. It doesn't match a whole string "only" in the middle with a word boundary, like the original problem reported.

I.e with ExactSearch "abc" it will NOT match at all "abcde abc zabc", but will match if the string is "abc" (so it's basically acting like a Map)
But with MultiPatternSearch abc will match 3 times.

It would be good to have an option where it can match inside an arbitrary long string, but only at word boundaries either side (eg if there is whitespace or end of line next to the match). I'd be happy to add a specific boundary character between words if it helps.

Hope that makes sense!

Just to give an idea of a hacky test that gets me closer, in the middle of MultiPatternSearch if I do...

// func (m *Machine) MultiPatternSearch(content []rune, returnImmediately bool) [](*Term) {
// ...start of func
// .. for _, word := range val {
// ...then add this inside the loop
// if previous word char is a whitespace and we are at the end of the string, and the char after the word is whitespace
        if ( content[ pos - len(word) ] < 34 ) && ( (pos+1 < contentLength && content[pos+1] < 34) || pos+1 == contentLength )  {

            term := new(Term)
            term.Pos = pos - len(word) + 1
            term.Word = word
            terms = append(terms, term)
            if returnImmediately {
                return terms
            }
        }

It naturally won't work for other none simple ascii languages, and would need a switch in the func to decide whether to use it not, but it's the sort of thing I was meaning maybe.

@ibrierley https://github.com/petar-dambovaliev/aho-corasick/tree/master
I implemented it, if this is what you were referring to.

Thanks for this! I've just added a comment/issue on your repos with a problem I'm having getting it going.