Final project for 6.945, implementing Thompson NFA's for pattern matching.
Thanks to Russ Cox and his helpful article at http://swtch.com/~rsc/regexp/regexp1.html.
In this project, we aim to recreate the pattern matching engine to utilize Thompson nondeterministic finite automata (NFA). Originally this project began as an extension of the match combinator system utilized in problem set 3, however upon implementing a few new features including back referencing and a multiplicity operator, we realized that the robust system made it easy to extend, and so did not provide a very interesting project. So, we analyzed the pattern matcher used in the problem set looking for ways to improve it, and found that the current system fails to work effectively for certain combinators. The conditions in which this pattern matching system does not perform well are situations in which large amounts of backtracking are required to find a match. One example that we looked at was the regular expression denoted by a?^30a^30. This regular expression can be understood as the character 'a' optionally chosen 30 times in a row followed by 30 required 'a's. The reason this does not perform well on the current system is because a? by default chooses to eat the 'a' and if the match later fails, backtracking will change the a? to not eat the 'a'. So, to successfully match an input of a^30, the program will first try to eat all 'a's with a?^30, then only the last a? will not eat an 'a', then two a?'s will not eat a's and so forth until eventually none of the a?'s will eat any a's and all the single character matchers will be satisfied. This requires the traversal of an exponentially growing tree, and when we ran this with only 22 a? and a's, it took over 3 minutes to complete.
So, we searched for solutions to this problem and found Thompson NFA's. This essentially is a system in which a state machine is created to represent the regular expression, then the data is fed into the state machine and if the final 'matching' state is reached. The difference between an NFA and traditional finite automata, is the removal for the need to backtrack by stepping to both potential states at a fork instead of choosing one. The reason this paradigm can eliminate the exponential seen in the example above, is because exactly one state per data character in the regular expression. So, the number of states in the generated NFA is no more than the length of the regular expression which created the NFA.
The general approach to this problem is to use generic operators to create an NFA from a regular expression similar to how the system in problem set 3 created match combinators which where applied to data. We represent an NFA as a network of nodes and edges. Nodes are stored in a hashtable, with a special start and end node and other nodes are denoted by an integer key. The value stored in the hash table for a node is a list of edges out of the given node. Edges are represented as a tagged list with a predicate to check if the edge should be traversed, and a destination node to denote the next step given that the predicate returns true on the input data.
As data is fed into the network, a 'probe' exists at a node and is progressed to all nodes for which the predicate of an outbound edge is satisfied by the next character in the data. So, the set of probes currently in the network represent possible ways in which the NFA is attempting to match the data. If an edge predicate is not satisfied, then the probe is removed from the NFA. If any probe reaches the 'end' node, then the system returns that there has been a successful match, and if all data is eaten without a probe reaching the 'end' then the NFA could not match the data, so false is returned.
The reason this solution uses an nondeterministic finite automata, instead of a deterministic one is to provide the ability to traverse multiple edges at once. One complication that came with spawning multiple probes out of a single node, is the fact that many structures had, what we call a blank edge, which should be traversed regardless of the data being fed into the NFA, and should not eat any data. This is a departure from the typical paradigm in which data is eaten by the whole network on each step, and all probes are advanced accordingly. So, in order to deal with this problem, when a probe reaches a destination, it must step through all blank edges before moving on. This way no more data is eaten by the network, but the probes end up in the positions we expect.
In order to implement this system, a subset of the regular expression language was chosen. The components chosen were a single character match, concatenation of characters, multiplicities of arguments (?, * and +), and the 'or' operator. The bulk of the new matching system can be seen in the matcher.scm file, while the individual components are created and tested in their own files (with the exception of single character also existing in matcher.scm since it is the default for the generic operator).
After implementing the NFA to be the pattern matching system, we ran the same experiment as earlier and it took less than a second to find the match. So we increased the number of a? and a's to be a?^100a^100 and this took about a second to match. We tested extensively on the subset of the regular expression language implemented to show the success of this implementation, and the speed improvements are obvious. Furthermore, this system is extensible, adding new components is as simple as understanding how they would be represented in an NFA, and using defhandlers to appropriately delegate work.
There are several limitations to what our system can do. First, it is limited to the subset of components we have chosen, which include single characters, concatenation, multiplicity (zero or one, zero or more, one or more), and choice. Another limitation regards nested lists - they may work in certain cases but are not guaranteed. Finally, NFAs are limited to solving only a subset of common regex functions. For example, NFAs cannot be implemented to solve backreferencing, which removes the ability to track variables and segments as in problem set 3.
There are many options for future work to improve the system we've created. First, an obvious thing we'd like to implement is to specify a lower and/or upper bound on the multiplicity of an expression. There are many other standard patterns of regular expressions that can be expressed through Thompson NFAs, and we'd like to add as many of those as possible. Another extension to our system that would be helpful is to give more information about the state of the system when it either fails or succeeds, such as why it failed or whether there is any data left to be consumed.