/supercomplex

A target-agnostic DFA-based lexer generator.

Primary LanguageC++MIT LicenseMIT

supercomplex

supercomplex is a simple C++ library that enables you to quickly write DFA-based lexer generators for any target language. This is useful for example, when writing a self-hosting compiler for a new language that does not yet have any standard lexer generators, such as flex or ANTLR.

It accepts the token description in the form of regular expressions (proper regular languages without extensions).

It uses Thompson's construction algorithm to generate a non-deterministic finite automaton based on the regular expression, then subset construction to transform said NFA to DFA. It can also minimize the number of states using Moore's algorithm.

The following syntax for regular expressions is supported (standard rules of precedence apply):

expression meaning
[0-9],[ATCG] character ranges and character sets
[^0-9ATCG] complements of character sets/ranges
expr1|expr2|... subexpression alternatives
(expr) subexpression grouping
(expr)* expr zero or multiple times
(expr)+ expr one or multiple times
(expr)? expr zero or one time

Note that . matches a literal period. Use [^] to match any character. The following characters must be escaped (with a \): [, ], (, ), +, -, *, |, \, ?. Otherwise, any character encountered is treated as literal (even '\0'). Regex can also be made to match empty string i.e. foo|bar| is valid (can be written as foo|bar|() for extra clarity). Empty expressions can be quantified, so ()+ is a valid expression (though useless), but quantifiers always require something on their left hand side. + and foo|* are therefore not valid expressions. Characters can be superfluously escaped - \4 is a valid expression equivalent to 4. Note however, that there are no standard escape sequences common in programming languages - \n is equivalent to n, not a newline.

supercomplex is templated so that it works with any standard C++ char type (char, wchar_t, char32_t, ...), so generating lexer accepting Unicode characters should not be a problem. Note that regular expressions must be of the same character type as the string being matched.

Since supercomplex is target-agnostic, you have to provide your own code generator for the target language. See examples/codegen_cpp_json.cpp to see how supercomplex can be used to generate a JSON lexer for C++ language target.