In what circumstances is the SIMD code used?
itamarst opened this issue · 3 comments
Hi,
Thanks for the library! Already see some speedups in my prototype benchmarks.
Was curious in what circumstances SIMD will be used? Basically I am trying to optimize purely for speed of running the regex, compile time and memory usage don't matter, so anything I can do to choose fast paths would help.
To expand: pretty sure the code was compiled with SIMD (e.g. if I do target=native I see AVX2 in the assembly). The question is which code paths will hit the fast path.
Ah, found DESIGN. I guess that basically answers the question (and suggests checking for 5000 keywords won't use SIMD).
Yes, 5,000 keywords will definitely blow the SIMD budget currently. The place in the code that determines whether to use the Teddy algorithm (the only explicitly SIMD algorithm in this crate, although memchr is used in some circumstances too) is here:
aho-corasick/src/packed/teddy/compile.rs
Lines 75 to 111 in a416b0c
So for Teddy, the maximum number of patterns is 64. The problem is that if you have too many, it overwhelms the algorithm and tends to generate too many false positive candidates. Then checking the candidates ends up dominating search time.
There are other SIMD algorithms that might scale better. Teddy was taken from Hyperscan, which also has an FDR algorithm that should work for larger numbers of literals. But I haven't had a chance to port that yet.
Otherwise, see also the AhoCorasickBuilder::auto_configure
method, which also tweaks some performance options. You'll definitely want to enable the use of a DFA for 5,000 keywords, assuming that compilation time doesn't matter too much.
Hope that helps. Happy to answer other questions!