oh no why
regress is a backtracking regular expression engine implemented in Rust, which targets JavaScript regular expression syntax. See the crate documentation for more.
It's fast, Unicode-aware, has few dependencies, and has a big test suite. It makes fewer guarantees than the regex
crate but it enables more syntactic features, such as backreferences and lookaround assertions.
Add this to your Cargo.toml
:
[dependencies]
regress = "0.1"
The tester
binary can be used for some fun.
You can see how things get compiled with the dump-phases
crate feature:
> cargo run --features dump-phases --bin tester 'x{3,4}' 'i'
You can run a little benchmark too, for example:
> cargo run --release --bin tester 'abcd' 'i' --file ~/3200.txt
This was my first Rust program so no doubt there is room for improvement.
There's lots of stuff still missing, maybe you want to contribute?
- Named capture groups like
(?<count>\d+)
- Named character classes like
[[:alpha:]]
- Unicode property escapes like
\p{Sc}
- An API for replacing a string while substituting in capture groups (e.g. with
$1
) - An API for escaping a string to make it a literal
- Implementing
std::str::pattern::Pattern
- The
tester
binary needs some real usage.
- Anchored matches like
^abc
still perform a string search. We should compute whether the whole regex is anchored, and optimize matching if so. - Non-greedy loops like
.*?
will eagerly compute their maximum match. This doesn't affect correctness but it does mean they may match more than they should. - Case-insensitive literals should compute the "preimage" (i.e. characters which fold together) instead of folding. In particular if the preimage is only that character this will accelerate matching.
- Pure literal searches should use Boyer-Moore or etc.
- The fold table should be bitpacked more tightly, e.g. using 24 bits for a code point.
- There are lots of vectorization opportunities.