This code is provided for educational purposes. It is not efficient. But it illustrates the idea.
Brzozowski derivative is a less known technique to work with regular expressions. Normally, to match a string using regex we have to construct an automaton and simulate it. With the regex derivative technique we can use a regular expression 'directly' without the automaton construction and simulation.
It has nothing to do with classical derivative in analysis. But the symbolic nature and chain rule application make it feel similar. For more details there is a paper "Regular-expression derivative re-examined" by Scott Owens, John Reppy and Aaron Turon.
This code implements only three operators. There are concatenation (
Usage:
> python match.py <regex> <string>
E.g.
> python match.py '(c|b)at' 'cat'
(c|b)·a·t
c: a·t
a: t
t: ε
True
> python match.py '(c|b)at' 'car'
(c|b)·a·t
c: a·t
a: t
r: ∅
False
The derivative of a language
For any characters a and b and for any regular expressions r and s we have following rules:
where the function
We need two rules with respect to strings to complete the rule set:
To find a match we have to check if the derivative of the regex
Let's check if word
Let's take a derivative with respect of each character separately:
So
There are several ways to implement the derivatives. Some implementations follow a pure functional style. Here instead I use good old imperative style to avoid hidden logic.
The match.py
invokes the following functions to find out if a string matches against a regex:
-
augment
: extends the input regex with an explicit concatenation operator ($\cdot$ ):$c*at \to c * \cdot a \cdot t$ . -
infix_to_postfix
: converts the augmented regex to a postfix expression. -
postfix_to_tree
: converts the postfix expression to a binary tree with tokens in leafs and operators in non-terminal nodes (update: current implementation uses plain python tuples to keep things simple). -
match
: invokesderiv
for each token of the input string. Then evaluates the regex withnullable
.-
deriv
: takes the derivative of the regex (now represented as a binary tree) with respect to a token of an input string. We modify the binary tree inplace. For some operators we have to clone a branch of the tree with the trivial recursive functionclone
. -
_sort
: sort the alternations in the tree using merge sort algorithm. -
_norm
: simplifies expressions like$r | \varepsilon$ or$\emptyset \cdot r$ . -
nullable
: checks if the resulting regex (binary tree) is nullable. If it is nullable then we found a match.
-
If we want to match multiple strings against a regular expression it is too expensive to compute the gradients for each string. We can do better. Let's pre-compute the derivatives for any possible input string.
The classical way is to construct non-deterministic finite state automaton (NFA) and then convert it to DFA. With Brzozowski derivatives we can construct DFA directly.
Given the regex
Q <- r queue to keep unexplored states
D <- r hash table to store all found states
while Q is not empty
r <- Q.pop()
for any c in alphabet A:
s = take derivative of r wrt c
if s not in dictionary D, then
Q.push(s)
The logic is implemented in the construct_dfa.py
script. It includes code to construct graphviz compatible graph specification.
> python construct_dfa.py '(c|m)at'
digraph G {
splines=true; rankdir=LR;
"(c|m)·a·t";
"(c|m)·a·t" -> "∅" [ label="a,t" ];
"(c|m)·a·t" -> "a·t" [ label="c,m" ];
"∅";
"∅" -> "∅" [ label="a,c,m,t" ];
"a·t";
"a·t" -> "t" [ label="a" ];
"a·t" -> "∅" [ label="c,m,t" ];
"t";
"t" -> "∅" [ label="a,c,m" ];
"t" -> "ε" [ label="t" ];
"ε" [peripheries=2];
"ε" -> "∅" [ label="a,c,m,t" ];
}
The way to render it using graphviz command line library:
> python construct_dfa.py '(c|m)at' | dot -Tpng > dfa.png
- It is fun
- Using this technique we can construct a very efficient automaton close to what is called minimal DFA. There are some considerations though but in some cases such construction can be very handy.
- Extend the derivative theory section
Add the DFA construction code- Explain regex normalization techniques