BurntSushi/aho-corasick

Enabling DFA breaks ASCII case insensitivity

alexkornitzer opened this issue · 4 comments

It appears that enabling dfa will stop ascii_case_insensitive from working, or are these two features intended to be mutally exclusive?

The below code will fail with dfa set to true, but will pass with it set to false.

extern crate aho_corasick;

use aho_corasick::AhoCorasickBuilder;

fn main() {
    let patterns = &["Samwise"];
    let haystack = "SAMWISE.abcd";

    let ac = AhoCorasickBuilder::new()
        .ascii_case_insensitive(true)
        .dfa(true)
        .build(patterns);
    let mat = ac.find(haystack).expect("should have a match");
    assert_eq!("SAMWISE", &haystack[mat.start()..mat.end()]);
}

Apologies in advanced if this is in the docs and I am just being unobservant.

Oof. This was a nasty bug. Thank you for reporting it. There was an issue building the equivalence classes during NFA construction, and those are only used via DFA matching. This bug is fixed in aho-corasick 0.7.7. If you want to work around it without upgrading, then you could disable byte classes via byte_classes(false), but I'd recommend upgrading if you can. :-)

Ah amazing, thanks for the rapid turnaround, super happy this is a bug and not intended behaviour (would have made my life hell :P). Will be updating straight away as DFA shaves of a considerable amount of time in my tagging engine!

I think that has fixed case_insensitive in RegexBuilder from the regex crate too, which I assume uses dfa by default?

@alexkornitzer The regex crate is a bit more complicated. It doesn't actually use the ascii_case_insensitive option from this crate. It should, but more analysis work is required to take advantage of it. (regex does Unicode case insensitivity by default, so using ascii_case_insensitive would only work when Unicode is disabled or if ASCII case insensitivity and Unicode case insensitivity would lead to the same result in some specific circumstances.)