Enabling DFA breaks ASCII case insensitivity
alexkornitzer opened this issue · 4 comments
It appears that enabling dfa
will stop ascii_case_insensitive
from working, or are these two features intended to be mutally exclusive?
The below code will fail with dfa
set to true, but will pass with it set to false.
extern crate aho_corasick;
use aho_corasick::AhoCorasickBuilder;
fn main() {
let patterns = &["Samwise"];
let haystack = "SAMWISE.abcd";
let ac = AhoCorasickBuilder::new()
.ascii_case_insensitive(true)
.dfa(true)
.build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("SAMWISE", &haystack[mat.start()..mat.end()]);
}
Apologies in advanced if this is in the docs and I am just being unobservant.
Oof. This was a nasty bug. Thank you for reporting it. There was an issue building the equivalence classes during NFA construction, and those are only used via DFA matching. This bug is fixed in aho-corasick 0.7.7
. If you want to work around it without upgrading, then you could disable byte classes via byte_classes(false)
, but I'd recommend upgrading if you can. :-)
Ah amazing, thanks for the rapid turnaround, super happy this is a bug and not intended behaviour (would have made my life hell :P). Will be updating straight away as DFA shaves of a considerable amount of time in my tagging engine!
I think that has fixed case_insensitive
in RegexBuilder from the regex
crate too, which I assume uses dfa by default?
@alexkornitzer The regex
crate is a bit more complicated. It doesn't actually use the ascii_case_insensitive
option from this crate. It should, but more analysis work is required to take advantage of it. (regex
does Unicode case insensitivity by default, so using ascii_case_insensitive
would only work when Unicode is disabled or if ASCII case insensitivity and Unicode case insensitivity would lead to the same result in some specific circumstances.)