/regex

A c++ regex library based on DFA

Primary LanguageC++MIT LicenseMIT

regex

Coverage Status CI

Motivation

Why another regex implementation?
While anyone is welcome to use this library, I doubt that it will be of any practical use. Most programming languages come with a native regex implementation (e.g. std::regex in c++). Furthermore, many standalone implementations exist with more functionality and better performance than this library. I created this library first and foremost as a learning exercise.

About

This library differentiates itself from other modern regular expression engines by being entirely implemented as a Deterministic Finite Automata (for better or worse). Engines driven by an underlying backtracking NFA tend to be faster because their construction is relatively cheap. However, this does come at a cost. Backtracking can lead to Catastrophic Backtracking. Furthermore, NFAs are, in principle, less deterministic in their behavior than DFAs. A DFA-based implementation has a high upfront construction time, but can match inputs in linear time.

Installation and CMake integration

Please check here for installation and CMake integration instructions.

Usage

Find online Doxygen documentation here.

Example code snippet:

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    auto regex = regex::Regex("col[ou]r");
    std::cout << std::boolalpha << regex.match("color") << std::endl;
    return 0;
}

Example code snippet with non-ASCII character code points:

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    auto regex = regex::Regex("[AÅÃ]");
    std::cout << std::boolalpha << regex.match("A") << std::endl;
    std::cout << std::boolalpha << regex.match("Å") << std::endl;
    std::cout << std::boolalpha << regex.match("Ã") << std::endl;
    return 0;
}

Exceptions are thrown on illegal usage (e.g. usage of unsupported regex features):

#include <iostream>
#include <regex/Regex.hpp>
#include <string>

int main()
{
    /* Anchors are not supported*/
    auto regex = regex::Regex("^Hello World!$");
    return 0;
}

throws exception:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Error at position 1. Message: Anchors are not supported 

Take a look at the unit tests for more examples.

Supported Regex Features

Regular Expression Engine Comparison Chart of all mainstream regular expression engines

Characters
Feature Support Note(s)
Backslash escapes one metacharacter YES
\Q...\E escapes a string of metacharacters NO
\x00 through \xFF (ASCII character) NO See
\u0000-\uFFFF
\u00000000-\u0010FFFF
\n (LF) YES
\r (CR) YES
\t (tab) YES
\f (form feed) YES
\v (vtab) YES
\a (bell) YES
\e (escape) NO
\b (backspace) NO
\B (backslash) NO
\\ (backslash) YES
\cA through \cZ (control character) NO
\ca through \cz (control character) NO
Character Classes or Character Sets
Feature Support Note(s)
[abc] character class YES
[^abc] negated character class YES
Hyphen in [\d-z] is a literal YES
Hyphen in [a-\d] is a literal YES
Hyphen in [\w-\d] is a literal YES
Backslash escapes one character class metacharacter YES
\Q...\E escapes a string of character class metacharacters NO
\d shorthand for digits YES Equivalent to [0-9]
\D (negation of \d) YES Equivalent to [^0-9]
\w shorthand for word characters YES Equivalent to [a-zA-Z0-9_]
\W (negation of \w) YES Equivalent to [^a-zA-Z0-9_]
\s shorthand for whitespace YES Equivalent to [ \t\r\n\f]
\S (negation of \s) YES Equivalent to [^ \t\r\n\f]
[\b] backspace NO
Dot
Feature Support Note(s)
. (dot) YES Excludes new-line (\n)
Anchors
Feature Support Note(s)
^ (start of string/line) NO
$ (end of string/line) NO
\A (start of string) NO
\Z (end of string, before final line break) NO
\z (end of string) NO
\` (start of string) NO
\' (end of string) NO
Word Boundaries
Feature Support Note(s)
\b (at the beginning or end of a word) NO
\B (NOT at the beginning or end of a word) NO
\y (at the beginning or end of a word) NO
\Y (NOT at the beginning or end of a word) NO
\m (at the beginning of a word) NO
\M (at the end of a word) NO
\< (at the beginning of a word) NO
\> (at the end of a word) NO
Alternation
Feature Support Note(s)
| (alternation) YES
Quantifiers
? (0 or 1) YES
* (0 or more) YES
+ (1 or more) YES
{n} (exactly n) YES
{n,m} (between n and m) YES
{n,} (n or more) YES
? after any of the above quantifiers to make it "lazy" NO
Grouping and Backreferences
Feature Support Note(s)
(regex) (numbered capturing group) NO Used for non-capturing groups instead (see below)
(regex) (non-capturing group) YES This library uses this syntax for non-capturing groups
In contrast, most other implementations use it for capturing groups!
(?:regex) (non-capturing group) NO
\1 through \9 (backreferences) NO
\10 through \99 (backreferences) NO
Forward references \1 through \9 NO
Nested references \1 through \9 NO
Backreferences non-existent groups are an error NO
Backreferences to failed groups also fail NO
Modifiers
Feature Support Note(s)
(?i) (case insensitive) NO
(?s) (dot matches newlines) NO
(?m) (^ and $ match at line breaks) NO
(?x) (free-spacing mode) NO
(?n) (explicit capture) NO
(?-ismxn) (turn off mode modifiers) NO
(?ismxn:group) (mode modifiers local to group) NO
Atomic Grouping and Possessive Quantifiers
Feature Support Note(s)
(?>regex) (atomic group) NO
?+, *+, ++ and {m,n}+ (possessive quantifiers) NO
Lookaround
(?=regex) (positive lookahead) NO
(?!regex) (negative lookahead) NO
(?<=text) (positive lookbehind) NO
(? NO
Continuing from The Previous Match
Feature Support Note(s)
\G (start of match attempt) NO
Conditionals
Feature Support Note(s)
(?(?=regex)then|else) (using any lookaround) NO
(?(regex)then|else) NO
(?(1)then|else) NO
(?(group)then|else) NO
Comments
Feature Support Note(s)
(?#comment) NO
Free-Spacing Syntax
Feature Support Note(s)
Free-spacing syntax supported NO
Character class is a single token NO
# starts a comment NO
Unicode Characters
Feature Support Note(s)
\X (Unicode grapheme) NO
\u0000 through \uFFFF (4 digit Unicode character) YES
\U00000000 through \u0010FFFF (8 digit Unicode character) YES
\x{0} through \x{FFFF} (Unicode character) NO
Unicode Properties, Scripts and Blocks
\pL through \pC (Unicode properties) NO
\p{L} through \p{C} (Unicode properties) NO
\p{Lu} through \p{Cn} (Unicode property) NO
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties) NO
\p{IsL} through \p{IsC} (Unicode properties) NO
\p{IsLu} through \p{IsCn} (Unicode property) NO
\p{Letter} through \p{Other} (Unicode properties) NO
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property) NO
\p{IsLetter} through \p{IsOther} (Unicode properties) NO
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property) NO
\p{Arabic} through \p{Yi} (Unicode script) NO
\p{IsArabic} through \p{IsYi} (Unicode script) NO
\p{BasicLatin} through \p{Specials} (Unicode block) NO
\p{InBasicLatin} through \p{InSpecials} (Unicode block) NO
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block) NO
Part between {} in all of the above is case insensitive NO
Spaces, hyphens and underscores allowed in all long names listed above
(e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)
NO
\P (negated variants of all \p as listed above) NO
\p{^...} (negated variants of all \p{...} as listed above) NO
Named Capture and Backreferences
Feature Support Note(s)
(?regex) (.NET-style named capturing group) NO
(?'name'regex) (.NET-style named capturing group) NO
\k (.NET-style named backreference) NO
\k'name' (.NET-style named backreference) NO
(?Pregex) (Python-style named capturing group NO
(?P=name) (Python-style named backreference) NO
multiple capturing groups can have the same name NO
XML Character Classes
Feature Support Note(s)
\i, \I, \c and \C shorthand XML name character classes NO
[abc-[abc]] character class subtraction NO
POSIX Bracket Expressions
Feature Support Note(s)
[:alpha:] POSIX character class NO
\p{Alpha} POSIX character class NO
\p{IsAlpha} POSIX character class NO
[.span-ll.] POSIX collation sequence NO
[=x=] POSIX character equivalence NO

Quality

This repository employs the following practices to achieve a reasonable level of quality:

  • Modern CMake to build, test, and install the library.
  • Continuous integration that:
    • builds on GNU g++
    • builds on clang++
    • runs unit tests
    • runs static code analysis (clang-tidy)
    • runs dynamic code analysis (Valgrind)
    • analyzes unit test code coverage
    • builds with many compiler warnings enabled