BurntSushi/aho-corasick

Question: Matches surrounded by separator characters

kiminuo opened this issue · 3 comments

Hi,

I use aho-corasick algorithm to search for keywords in a text. This example is basically what I use https://github.com/BurntSushi/aho-corasick#example-basic-searching Yet, there is one caveat. I need to find out whether a match (i.e. a found keyword) provided by this algo is surrounded by separator characters (e.g. " ", "?", "!", ".", "*", ",", ";", ":", "\n", "\r", "(", ")", "„", "“", "<", ">").

e.g.

let patterns = &["apple"];
let haystack = "I love Snapple.";

Snapple word contains apple, so it is a match but it is not in form [separator-character]apple[separator-character] so I want to skip this match in my program.

aho-corasick algo returns matches as byte positions - i.e. (keyword_index, start, end) and my text is in UTF8 with non-ascii characters.

I can think of several solutions but I don't find any of them elegant.

Does anyone know a proper-rust-solution for this?
Thank you!

Why did you close this issue? If you found a solution, please share it. :)

My knowledge of Rust is rather limited so the solution may not be great. Anyway, with help of Rust community, I've come up with:

const DELIMITERS: [char; 17] = [' ', '?', '!', '.', '*', ',', ';', ':', '\n', '\r', '(', ')', '„', '“', '<', '>', '\u{c2a0}' /* non-breaking space */ ];

// The idea is to get str slice up to first matching character (i.e. `start`) and take last character. They say that the last character is not O(n) as one may expect but rather O(1) as there is some specialized code in Rust, so it should be fast.
let to_replace = DELIMITERS.contains(&(&text[0..*start]).chars().last().unwrap());

(the code is simplified to test only character before match)

Yes, indeed, that LGTM. And yes, str::Chars::last is a O(1) operation.

Also, if you're searching for a small number of strings, you could also just use a regex. Although Aho-Corasick will probably be faster. Hard to say.

I'm happy to help with other text search questions you might have. In the future though, it would be much appreciate if you could post your full program. It's hard to help without that.

I'll close this now since it seems like you have an answer!