regex

Motivation

Why another regex implementation?
While anyone is welcome to use this library, I doubt that it will be of any practical use. Most programming languages come with a native regex implementation (e.g. std::regex in c++). Furthermore, many standalone implementations exist with more functionality and better performance than this library. I created this library first and foremost as a learning exercise.

About

This library differentiates itself from other modern regular expression engines by being entirely implemented as a Deterministic Finite Automata (for better or worse). Engines driven by an underlying backtracking NFA tend to be faster because their construction is relatively cheap. However, this does come at a cost. Backtracking can lead to Catastrophic Backtracking. Furthermore, NFAs are, in principle, less deterministic in their behavior than DFAs. A DFA-based implementation has a high upfront construction time, but can match inputs in linear time.

Installation and CMake integration

Please check here for installation and CMake integration instructions.

Usage

Find online Doxygen documentation here.

Example code snippet:

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    auto regex = regex::Regex("col[ou]r");
    std::cout << std::boolalpha << regex.match("color") << std::endl;
    return 0;
}

Example code snippet with non-ASCII character code points:

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    auto regex = regex::Regex("[AÅÃ]");
    std::cout << std::boolalpha << regex.match("A") << std::endl;
    std::cout << std::boolalpha << regex.match("Å") << std::endl;
    std::cout << std::boolalpha << regex.match("Ã") << std::endl;
    return 0;
}

Exceptions are thrown on illegal usage (e.g. usage of unsupported regex features):

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    /* Anchors are not supported*/
    auto regex = regex::Regex("^Hello World!$");
    return 0;
}

throws exception:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Error at position 1. Message: Anchors are not supported

Take a look at the unit tests for more examples.

Supported Regex Features

Regular Expression Engine Comparison Chart of all mainstream regular expression engines

Characters
Feature	Support	Note(s)
Backslash escapes one metacharacter	YES
\Q...\E escapes a string of metacharacters	NO
\x00 through \xFF (ASCII character)	NO	See \u0000-\uFFFF \u00000000-\u0010FFFF
\n (LF)	YES
\r (CR)	YES
\t (tab)	YES
\f (form feed)	YES
\v (vtab)	YES
\a (bell)	YES
\e (escape)	NO
\b (backspace)	NO
\B (backslash)	NO
\\ (backslash)	YES
\cA through \cZ (control character)	NO
\ca through \cz (control character)	NO
Character Classes or Character Sets
Feature	Support	Note(s)
[abc] character class	YES
[^abc] negated character class	YES
Hyphen in [\d-z] is a literal	YES
Hyphen in [a-\d] is a literal	YES
Hyphen in [\w-\d] is a literal	YES
Backslash escapes one character class metacharacter	YES
\Q...\E escapes a string of character class metacharacters	NO
\d shorthand for digits	YES	Equivalent to [0-9]
\D (negation of \d)	YES	Equivalent to [^0-9]
\w shorthand for word characters	YES	Equivalent to [a-zA-Z0-9_]
\W (negation of \w)	YES	Equivalent to [^a-zA-Z0-9_]
\s shorthand for whitespace	YES	Equivalent to [ \t\r\n\f]
\S (negation of \s)	YES	Equivalent to [^ \t\r\n\f]
[\b] backspace	NO
Dot
Feature	Support	Note(s)
. (dot)	YES	Excludes new-line (\n)
Anchors
Feature	Support	Note(s)
^ (start of string/line)	NO
$ (end of string/line)	NO
\A (start of string)	NO
\Z (end of string, before final line break)	NO
\z (end of string)	NO
\` (start of string)	NO
\' (end of string)	NO
Word Boundaries
Feature	Support	Note(s)
\b (at the beginning or end of a word)	NO
\B (NOT at the beginning or end of a word)	NO
\y (at the beginning or end of a word)	NO
\Y (NOT at the beginning or end of a word)	NO
\m (at the beginning of a word)	NO
\M (at the end of a word)	NO
\< (at the beginning of a word)	NO
\> (at the end of a word)	NO
Alternation
Feature	Support	Note(s)
\| (alternation)	YES
Quantifiers
? (0 or 1)	YES
* (0 or more)	YES
+ (1 or more)	YES
{n} (exactly n)	YES
{n,m} (between n and m)	YES
{n,} (n or more)	YES
? after any of the above quantifiers to make it "lazy"	NO
Grouping and Backreferences
Feature	Support	Note(s)
(regex) (numbered capturing group)	NO	Used for non-capturing groups instead (see below)
(regex) (non-capturing group)	YES	This library uses this syntax for non-capturing groups In contrast, most other implementations use it for capturing groups!
(?:regex) (non-capturing group)	NO
\1 through \9 (backreferences)	NO
\10 through \99 (backreferences)	NO
Forward references \1 through \9	NO
Nested references \1 through \9	NO
Backreferences non-existent groups are an error	NO
Backreferences to failed groups also fail	NO
Modifiers
Feature	Support	Note(s)
(?i) (case insensitive)	NO
(?s) (dot matches newlines)	NO
(?m) (^ and $ match at line breaks)	NO
(?x) (free-spacing mode)	NO
(?n) (explicit capture)	NO
(?-ismxn) (turn off mode modifiers)	NO
(?ismxn:group) (mode modifiers local to group)	NO
Atomic Grouping and Possessive Quantifiers
Feature	Support	Note(s)
(?>regex) (atomic group)	NO
?+, *+, ++ and {m,n}+ (possessive quantifiers)	NO
Lookaround
(?=regex) (positive lookahead)	NO
(?!regex) (negative lookahead)	NO
(?<=text) (positive lookbehind)	NO
(?	NO
Continuing from The Previous Match
Feature	Support	Note(s)
\G (start of match attempt)	NO
Conditionals
Feature	Support	Note(s)
(?(?=regex)then\|else) (using any lookaround)	NO
(?(regex)then\|else)	NO
(?(1)then\|else)	NO
(?(group)then\|else)	NO
Comments
Feature	Support	Note(s)
(?#comment)	NO
Free-Spacing Syntax
Feature	Support	Note(s)
Free-spacing syntax supported	NO
Character class is a single token	NO
# starts a comment	NO
Unicode Characters
Feature	Support	Note(s)
\X (Unicode grapheme)	NO
\u0000 through \uFFFF (4 digit Unicode character)	YES
\U00000000 through \u0010FFFF (8 digit Unicode character)	YES
\x{0} through \x{FFFF} (Unicode character)	NO
Unicode Properties, Scripts and Blocks
\pL through \pC (Unicode properties)	NO
\p{L} through \p{C} (Unicode properties)	NO
\p{Lu} through \p{Cn} (Unicode property)	NO
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties)	NO
\p{IsL} through \p{IsC} (Unicode properties)	NO
\p{IsLu} through \p{IsCn} (Unicode property)	NO
\p{Letter} through \p{Other} (Unicode properties)	NO
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property)	NO
\p{IsLetter} through \p{IsOther} (Unicode properties)	NO
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property)	NO
\p{Arabic} through \p{Yi} (Unicode script)	NO
\p{IsArabic} through \p{IsYi} (Unicode script)	NO
\p{BasicLatin} through \p{Specials} (Unicode block)	NO
\p{InBasicLatin} through \p{InSpecials} (Unicode block)	NO
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block)	NO
Part between {} in all of the above is case insensitive	NO
Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)	NO
\P (negated variants of all \p as listed above)	NO
\p{^...} (negated variants of all \p{...} as listed above)	NO
Named Capture and Backreferences
Feature	Support	Note(s)
(?regex) (.NET-style named capturing group)	NO
(?'name'regex) (.NET-style named capturing group)	NO
\k (.NET-style named backreference)	NO
\k'name' (.NET-style named backreference)	NO
(?Pregex) (Python-style named capturing group	NO
(?P=name) (Python-style named backreference)	NO
multiple capturing groups can have the same name	NO
XML Character Classes
Feature	Support	Note(s)
\i, \I, \c and \C shorthand XML name character classes	NO
[abc-[abc]] character class subtraction	NO
POSIX Bracket Expressions
Feature	Support	Note(s)
[:alpha:] POSIX character class	NO
\p{Alpha} POSIX character class	NO
\p{IsAlpha} POSIX character class	NO
[.span-ll.] POSIX collation sequence	NO
[=x=] POSIX character equivalence	NO

Quality

This repository employs the following practices to achieve a reasonable level of quality:

Modern CMake to build, test, and install the library.
Continuous integration that:
- builds on GNU g++
- builds on clang++
- runs unit tests
- runs static code analysis (clang-tidy)
- runs dynamic code analysis (Valgrind)
- analyzes unit test code coverage
- builds with many compiler warnings enabled