Authors: Aurèle Barrière and Clément Pit-Claudel.
This is a linear regular expression engine for a subset of JavaScript regexes. The underlying algorithm is an extension of the PikeVM, supporting more JavaScript features. This engine implements the algorithms described in the paper Linear Matching of JavaScript Regular Expressions by the same authors.
In particular, it supports, for the first time with linear time and space complexity:
- nullable JavaScript quantifiers (these have different semantics than in other regex languages, see for instance
(a?b??)*
on string "ab") - capture reset, a JavaScript-specific property where capture groups are reset at each quantifier iteration (for instance
((a)|(b))*
on string "ab") - all lookarounds (lookahads and lookbehinds), even with capture groups inside
- linear matching of the greedy or nullable plus.
RegElk means Regex Engine with Linear looKarounds. Elks are diagonal walkers, meaning that they reuse their front legs prints for their rear legs to conserve energy, evoking how a PikeVM merges threads reaching the same state to preserve linearity.
Given a regex of size |r|
and a string of size |s|
, this engine has linear worst-case time complexity in both of them O(|r|*|s|)
.
While counted quantifiers are supported, they increase the regex size.
For instance, e{4-8}
will multiply the size of e
8 times.
However, the greedy plus (+
or {1,}
) or the nonnullable lazy plus (as in (ab)+?
) are handled without duplication.
The engine also has O(|r|*|s|)
space complexity.
If one wants to avoid a string-size dependent space complexity, we provide alternative register data-structures, presenting various time-space complexity tradeoff.
Time Complexity | Space Complexity | |
---|---|---|
List (default) | O(|r|*|s|) |
O(|r|*|s|) |
Array | O(|r|^2*|s|) |
O(|r|^2) |
Tree | O(|r|*log(|r|)*|s|) |
O(|r|^2) |
Note however that a O(|r|*|s|)
space complexity cannot be avoided when using our linear lookaround algorithm.
Feature | Example |
---|---|
Lookaheads | a(?=(b)) , a(?!=b) |
Lookbehinds | (?<=b)a , (?<!b)a |
Capture Groups | (a*)b |
Noncapturing Groups | (?:a*)b |
Greedy Quantifiers | * , + , ? |
Lazy Quantifiers | *? , +? , ?? |
Counted Quantifiers | a{6,12} , a{7,} , a{9} , a{4,5}? |
Character Classes | [a-z] , [^h] , [aeiouy] |
Character Groups | \w , \d , \s , \W , \D , \S |
Anchors | $ , ^ |
Word Boundaries | \b , \B |
Backreferences are not supported, as they make the matching problem NP-hard. Named capture groups, hexadecimal escapes, unicode escapes, unicode properties and regex flags are not supported yet, although they could be in the future.
You need the following Opam packages. Other version numbers may also work.
- Ocaml 5.0
- ocamlbuild 0.14.1
- Menhir 20220210
- ocaml_intrinsics v0.15.2
- core_bench v0.15.0
- core v0.15.1
- core_unix v0.15.2
- yojson 2.1.0
You also need to install Node.JS and have node
in your path.
Build all executables with make
.
Make sure to have configured your opam switch so that it has all dependencies listed above.
This creates several executables:
main.native
is the Ocaml matcherfuzzer.native
is a fuzzer that compares the OCaml matcher to the Irregexp engine of V8 in Nodetests.native
contains a battery of tests that should all succeedstats.native
computes regex feature usage statistics from corpora of regexesbenchmark.native
allows you to run benchmarksmatcher.native
andlinearbaseline.native
are only used for the benchmarks and you should not run them directly
main.native
, fuzzer.native
and benchmark.native
have command line options that can be printed with the argument --help
.
- the main entry point of the engine is in the
main.ml
file - regexes and regex annotation are found in
regex.ml
- the extended bytecode NFA representation is defined in
bytecode.ml
- compilation from a regex to bytecode is found in
compiler.ml
- the NFA simulation algorithm, with all our extensions, is implemented in
interpreter.ml
- the CDN plus formulas are defined in
cdn.ml
- the oracles used y the lookaround algorithm is defined in
oracle.ml
- the three capture registers implementations are defined in
regs.ml
- character classes are implemented in
charclasses.ml
- the ECMA-style regex parser is defined in
parser_src/
- the differential fuzzer is implemented in
fuzzer.ml
- the computation of statistics on regex features usage is defined in
stats.ml
- the
scripts_bench
directory contains scripts called by the benchmarks or the fuzzer to compare the OCaml engine to other engines.
- The linear engine from V8 is called "V8Linear" in the paper. It is sometimes called "Experimental" or "Exp" in the code (as this is the name used by the V8 developers).
- The bytecode instructions
Consume
andConsumeAny
from Figure 4 are replaced by a singleConsume
instruction inbytecode.ml
, which takes as argument either a character or a list of character ranges. - The
Jump
instruction is calledJmp
in the code. SetReg
is calledSetRegisterToCP
.SetQuant
is calledSetQuantToClock
.CheckNull
is calledCheckNullable
.SetNullPlus
in the paper is not an independent instruction in the code. InsteadSetQuantToClock
encodes bothSetQuant
andSetNullPlus
: it takes a boolean argument indicating if this quantifier is a nulled plus or not.- The "Balanced Tree" register implementation in the paper is renamed to
Map_Regs
in the code.
This is implemented by the functions advance_epsilon
and find_match
in interpreter.ml
.
- In
compiler.ml
, line 112, you can see the bytecode compilation of a quantifier and see thatBeginLoop
andEndLoop
instructions are inserted. - In
interpreter.ml
, the thread booleanexit_allowed
encodes in which automata the thread is. See at lines 412 and 417 how the two new instructions are implemented.
- Threads are augmented with clocks in
quant_regs
(line 107 ofinterpreter.ml
). - The filtering algorithm is implemented at line 285 of
interpreter.ml
.
- The oracle table is defined in
oracle.ml
. - The first phase is implemented as the
build_oracle
function, line 651 ofinterpreter.ml
. - The second phase is simply the
find_match
function defined previously. - The third phase is implemented as the
build_capture
function, line 680 ofinterpreter.ml
.
Switch to the strlb
directory for this algorithm, and see the corresponding README.md
file.
- The nullability analysis of Section 4.5.1 and Figure 12 starts at line 139 in
regex.ml
. - The non-nullable plus case of Section 4.5.2 is implemented at line 78 in
compiler.ml
. - For Section 4.5.3, see lines 90 and 101 of
compiler.ml
. - The nullability formulas of footnote 8 are defined in
cdn.ml
and called "CDN formulas".
- The three different register data-structures are defined in
regs.ml
:Array_Regs
,List_Regs
andMap_Regs
. - At line 28 of
interpreter.ml
, see that the interpreter is parameterized by a register implementation (aREGS
module). - The benchmark used for Figure 15 is defined as
dsarray
,dslist
anddstree
inbenchmark_vectors.ml
.
- All regex corpora are in the
corpus
directory. - The
stats.ml
file uses the parser to analyze each regex and see in which categories it belongs to.
- (C1) the benchmark used for Figure 17 is named "Clocks", defined line 117 of
benchmark_vectors
. - (C2) the benchmark used for Figure 18 is named "NNPlus", defined line 69.
- (C3) the benchmark used for Figure 19 is named "CDN", defined line 95.
- (C4) the benchmark used for Figure 20 is named "LBstr", defined line 179.
- (C5) the benchmark used for Figure 21 is named "LAreg", defined line 138.
- (C5) the benchmark used for Figure 22 is named "LAstr", defined line 157.
- For every pair of regex and string discussed in the paper, we added this test to our test suite.
- This list is the
paper_tests
list at line 290 oftests.ml
. - This test suite is executed for each of the three register implementations when running
tests.native
.