teddy

Teddy is a SIMD accelerated multiple substring matching algorithm, and this is an implementation of it in rust. The name and the core ideas in the algorithm were learned from the Hyperscan project, and this implementation was originally taken from the regex crate.

Installation and usage

Currently, Teddy only supports x86 and x86_64 processors with support for the SSSE3 SIMD instructions. Moreover, it requires a nightly rust compiler. To set up Teddy for your rust project, add

teddy = { version = "0.2", features = "simd-accel" }

to the [dependencies] section of your Cargo.toml file. Then you need to configure rustc to support the required SIMD instructions. Assuming that you are compiling only for your local machine, you can do this by compiling your project with the command

RUSTFLAGS="-C target-cpu=native" cargo build

If you want more control over the SIMD instruction set that is used, you can use instead

RUSTFLAGS="-C target-feature=+ssse3,+avx2" cargo build

(or some subset of those features -- currently, at least one of +ssse3 or +avx2 is required).

Here is a basic example of Teddy usage:

extern crate teddy;
use teddy::Teddy;

fn main() {
  let patterns = vec![b"cat", b"dog", b"fox"];
  let ted = Teddy::new(patterns.iter().map(|s| &s[..])).unwrap();
  println!("{:?}", ted.find(b"The quick brown fox jumped over the laxy dog."));
}

For more information, see the API documentation.

What if I don't have SSSE3 or a nightly compiler?

Then Teddy won't do much for you. For your convenience, however, Teddy will compile even if you don't have a nightly compiler or support for the right instruction set. In this case, Teddy will refuse at run-time to give you accelerated searches: specifically, Teddy::new will always return None. You can also check for acceleration support using Teddy::is_accelerated.

There is a reason that this crate supports being compiled without any of its useful parts. This way, your program or library can depend on Teddy unconditionally without sacrificing support for stable rust compilers. To get this working, add a dependency on teddy without the simd-accel feature:

[dependencies]
teddy = "0.2"

Then add a simd-accel feature to your project that just passes the simd-accel feature onto teddy:

[features]
simd-accel = ["teddy/simd-accel"]

Now, anyone who compiles your crate using cargo build will get a non-accelerated version. Anyone who builds your crate with cargo build --features=simd-accel will get an accelerated version, but they'd better be using a nightly compiler and have the right target features set up.

Acknowledgements

Although it might look like this crate is mainly my work, it isn't really. I started out by lifting Andrew Gallant's code straight from the regex crate. Although much of that code has been replaced now, the current implementation still owes a lot to the documentation below, which was produced by Andrew Gallant after he pored for who-knows-how-long over Hyperscan's undocumented code.

Background

The key idea of Teddy is to do packed substring matching. In the literature, packed substring matching is the idea of examining multiple bytes in a haystack at a time to detect matches. Implementations of, for example, memchr (which detects matches of a single byte) have been doing this for years. Only recently, with the introduction of various SIMD instructions, has this been extended to substring matching. The PCMPESTRI instruction (and its relatives), for example, implements substring matching in hardware. It is, however, limited to substrings of length 16 bytes or fewer, but this restriction is fine in a regex engine, since we rarely care about the performance difference between searching for a 16 byte literal and a 16 + N literal—16 is already long enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs at least, is its latency and throughput. As a result, it is often faster to do substring search with a Boyer-Moore variant and a well placed memchr to quickly skip through the haystack.

There are fewer results from the literature on packed substring matching, and even fewer for packed multiple substring matching. Ben-Kiki et al. [2] describes use of PCMPESTRI for substring matching, but is mostly theoretical and hand-waves performance. There is other theoretical work done by Bille [3] as well.

The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci and is generally focused on multiple pattern search. Their first paper [4a] introduces the concept of a fingerprint, which is computed for every block of N bytes in every pattern. The haystack is then scanned N bytes at a time and a fingerprint is computed in the same way it was computed for blocks in the patterns. If the fingerprint corresponds to one that was found in a pattern, then a verification step follows to confirm that one of the substrings with the corresponding fingerprint actually matches at the current location. Various implementation tricks are employed to make sure the fingerprint lookup is fast; typically by truncating the fingerprint. (This may, of course, provoke more steps in the verification process, so a balance must be struck.)

The main downside of [4a] is that the minimum substring length is 32 bytes, presumably because of how the algorithm uses certain SIMD instructions. This essentially makes it useless for general purpose regex matching, where a small number of short patterns is far more likely.

Faro and Kulekci published another paper [4b] that is conceptually very similar to [4a]. The key difference is that it uses the CRC32 instruction (introduced as part of SSE 4.2) to compute fingerprint values. This also enables the algorithm to work effectively on substrings as short as 7 bytes with 4 byte windows. 7 bytes is unfortunately still too long. The window could be technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the small window size ends up negating most performance benefits—and it's likely the common case in a general purpose regex engine.

Faro and Kulekci also published [4c] that appears to be intended as a replacement to using PCMPESTRI. In particular, it is specifically motivated by the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD instructions that are faster. While this approach works for short substrings, I personally couldn't see a way to generalize it to multiple substring search.

Faro and Kulekci have another paper [4d] that I haven't been able to read because it is behind a paywall.

Teddy

Finally, we get to Teddy. If the above literature review is complete, then it appears that Teddy is a novel algorithm. More than that, in my experience, it completely blows away the competition for short substrings, which is exactly what we want in a general purpose regex engine. Again, the algorithm appears to be developed by the authors of Hyperscan. Hyperscan was open sourced late 2015, and no earlier history could be found. Therefore, tracking the exact provenance of the algorithm with respect to the published literature seems difficult.

DISCLAIMER: My understanding of Teddy is limited to reading auto-generated C code, its disassembly and observing its runtime behavior.

At a high level, Teddy works somewhat similarly to the fingerprint algorithms published by Faro and Kulekci, but Teddy does it in a way that scales a bit better. Namely:

Teddy's core algorithm scans the haystack in 16 or 32 byte chunks, depending on your processor. We will describe the algorithm here for 16 byte chunks, but the 32 byte case is similar.
Bitwise operations are performed on each chunk to discover if any region of it matches a set of precomputed fingerprints from the patterns. If there are matches, then a verification step is performed. In this implementation, our verificiation step is a naive. This can be improved upon.

The details to make this work are quite clever. First, we must choose how to pick our fingerprints. In Hyperscan's implementation, I believe they use the last N bytes of each substring, where N must be at least the minimum length of any substring in the set being searched. In this implementation, we use the first N bytes of each substring. (The tradeoffs between these choices aren't yet clear to me.) We then must figure out how to quickly test whether an occurrence of any fingerprint from the set of patterns appears in a 16 byte block from the haystack. To keep things simple, let's assume N = 1 and examine some examples to motivate the approach. Here are our patterns:

foo
bar
baz

The corresponding fingerprints, for N = 1, are f, b and b. Now let's set our 16 byte block to:

bat cat foo bump
xxxxxxxxxxxxxxxx

To cut to the chase, Teddy works by using bitsets. In particular, Teddy creates a mask that allows us to quickly compute membership of a fingerprint in a 16 byte block that also tells which pattern the fingerprint corresponds to. In this case, our fingerprint is a single byte, so an appropriate abstraction is a map from a single byte to a list of patterns that contain that fingerprint:

f |--> foo
b |--> bar, baz

Now, all we need to do is figure out how to represent this map in vector space and use normal SIMD operations to perform a lookup. The first simplification we can make is to represent our patterns as bit fields occupying a single byte. This is important, because a single SIMD vector can store 16 bytes.

f |--> 00000001
b |--> 00000010, 00000100

How do we perform lookup though? It turns out that SSSE3 introduced a very cool instruction called PSHUFB. The instruction takes two SIMD vectors, A and B, and returns a third vector C. All vectors are treated as 16 8-bit integers. C is formed by C[i] = A[B[i]]. (This is a bit of a simplification, but true for the purposes of this algorithm. For full details, see Intel's Intrinsics Guide.) This essentially lets us use the values in B to lookup values in A.

If we could somehow cause B to contain our 16 byte block from the haystack, and if A could contain our bitmasks, then we'd end up with something like this for A:

    0x00 0x01 ... 0x62      ... 0x66      ... 0xFF
A = 0    0        00000001      00000110      0

And if B contains our window from our haystack, we could use shuffle to take the values from B and use them to look up our bitsets in A. But of course, we can't do this because A in the above example contains 256 bytes, which is much larger than the size of a SIMD vector.

Nybbles to the rescue! A nybble is 4 bits. Instead of one mask to hold all of our bitsets, we can use two masks, where one mask corresponds to the lower four bits of our fingerprint and the other mask corresponds to the upper four bits. So our map now looks like:

'f' & 0xF = 0x6 |--> 00000001
'f' >> 4  = 0x6 |--> 00000111
'b' & 0xF = 0x2 |--> 00000110
'b' >> 4  = 0x6 |--> 00000111

Notice that the bitsets for each nybble correspond to the union of all fingerprints that contain that nibble. For example, both f and b have the same upper 4 bits but differ on the lower 4 bits. Putting this together, we have A0, A1 and B, where A0 is our mask for the lower nybble, A1 is our mask for the upper nybble and B is our 16 byte block from the haystack:

      0x00 0x01 0x02      0x03 ... 0x06      ... 0xF
A0 =  0    0    00000110  0        00000001      0
A1 =  0    0    0         0        00000111      0
B  =  b    a    t         _        t             p
B  =  0x62 0x61 0x74      0x20     0x74          0x70

But of course, we can't use B with PSHUFB yet, since its values are 8 bits, and we need indexes that are at most 4 bits (corresponding to one of 16 values). We can apply the same transformation to split B into lower and upper nybbles as we did A. As before, B0 corresponds to the lower nybbles and B1 corresponds to the upper nybbles:

     b   a   t   _   c   a   t   _   f   o   o   _   b   u   m   p
B0 = 0x2 0x1 0x4 0x0 0x3 0x1 0x4 0x0 0x6 0xF 0xF 0x0 0x2 0x5 0xD 0x0
B1 = 0x6 0x6 0x7 0x2 0x6 0x6 0x7 0x2 0x6 0x6 0x6 0x2 0x6 0x7 0x6 0x7

And now we have a nice correspondence. B0 can index A0 and B1 can index A1. Here's what we get when we apply C0 = PSHUFB(A0, B0):

     b         a        ... f         o         ... p
     A0[0x2]   A0[0x1]      A0[0x6]   A0[0xF]       A0[0x0]
C0 = 00000110  0            00000001  0             0

And C1 = PSHUFB(A1, B1):

     b         a        ... f         o        ... p
     A1[0x6]   A1[0x6]      A1[0x6]   A1[0x6]      A1[0x7]
C1 = 00000111  00000111     00000111  00000111     0

Notice how neither one of C0 or C1 is guaranteed to report fully correct results all on its own. For example, C1 claims that b is a fingerprint for the pattern foo (since A1[0x6] = 00000111), and that o is a fingerprint for all of our patterns. But if we combined C0 and C1 with an AND operation:

     b         a        ... f         o        ... p
C  = 00000110  0            00000001  0            0

Then we now have that C[i] contains a bitset corresponding to the matching fingerprints in a haystack's 16 byte block, where i is the ith byte in that block.

Once we have that, we can look for the position of the least significant bit in C. That position, modulo 8, gives us the pattern that the fingerprint matches. That position, integer divided by 8, also gives us the byte offset that the fingerprint occurs in inside the 16 byte haystack block. Using those two pieces of information, we can run a verification procedure that tries to match all substrings containing that fingerprint at that position in the haystack.

Implementation notes

The problem with the algorithm as described above is that it uses a single byte for a fingerprint. This will work well if the fingerprints are rare in the haystack (e.g., capital letters or special characters in normal English text), but if the fingerprints are common, you'll wind up spending too much time in the verification step, which effectively negate the performance benefits of scanning 16 bytes at a time. Remember, the key to the performance of this algorithm is to do as little work as possible per 16 bytes.

This algorithm can be extrapolated in a relatively straight-forward way to use larger fingerprints. That is, instead of a single byte prefix, we might use a three byte prefix. The implementation below implements N = {1, 2, 3} and always picks the largest N possible. The rationale is that the bigger the fingerprint, the fewer verification steps we'll do. Of course, if N is too large, then we'll end up doing too much on each step.

The way to extend it is:

Add a mask for each byte in the fingerprint. (Remember that each mask is composed of two SIMD vectors.) This results in a value of C for each byte in the fingerprint while searching.
When testing each 16 byte block, each value of C must be shifted so that they are aligned. Once aligned, they should all be AND'd together. This will give you only the bitsets corresponding to the full match of the fingerprint.

The implementation below is commented to fill in the nitty gritty details.

References

[1] Hyperscan on GitHub, webpage
[2a] Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R., & Weimann, O. (2011). Optimal packed string matching. In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. DOI: 10.4230/LIPIcs.FSTTCS.2011.423. PDF.
[2b] Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R., & Weimann, O. (2014). Towards optimal packed string matching. Theoretical Computer Science, 525, 111-129. DOI: 10.1016/j.tcs.2013.06.013. PDF.
[3] Bille, P. (2011). Fast searching in packed strings. Journal of Discrete Algorithms, 9(1), 49-56. DOI: 10.1016/j.jda.2010.09.003. PDF.
[4a] Faro, S., & Külekci, M. O. (2012, October). Fast multiple string matching using streaming SIMD extensions technology. In String Processing and Information Retrieval (pp. 217-228). Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-34109-0_23. PDF.
[4b] Faro, S., & Külekci, M. O. (2013, September). Towards a Very Fast Multiple String Matching Algorithm for Short Patterns. In Stringology (pp. 78-91). PDF.
[4c] Faro, S., & Külekci, M. O. (2013, January). Fast packed string matching for short patterns. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (pp. 113-121). Society for Industrial and Applied Mathematics. PDF.
[4d] Faro, S., & Külekci, M. O. (2014). Fast and flexible packed string matching. Journal of Discrete Algorithms, 28, 61-72. DOI: 10.1016/j.jda.2014.07.003.

stevefan1999-personal/teddy