Why another regex implementation?
While anyone is welcome to use this library, I doubt that it will be of any practical use. Most programming languages come with a native regex implementation (e.g. std::regex in c++). Furthermore, many standalone implementations exist with more functionality and better performance than this library. I created this library first and foremost as a learning exercise.
This library differentiates itself from other modern regular expression engines by being entirely implemented as a Deterministic Finite Automata (for better or worse). Engines driven by an underlying backtracking NFA tend to be faster because their construction is relatively cheap. However, this does come at a cost. Backtracking can lead to Catastrophic Backtracking. Furthermore, NFAs are, in principle, less deterministic in their behavior than DFAs. A DFA-based implementation has a high upfront construction time, but can match inputs in linear time.
Please check here for installation and CMake integration instructions.
Find online Doxygen documentation here.
Example code snippet:
#include <iostream>
#include <regex/Regex.hpp>
#include <string>
int main()
{
auto regex = regex::Regex("col[ou]r");
std::cout << std::boolalpha << regex.match("color") << std::endl;
return 0;
}
Example code snippet with non-ASCII character code points:
#include <iostream>
#include <regex/Regex.hpp>
#include <string>
int main()
{
auto regex = regex::Regex("[AÅÃ]");
std::cout << std::boolalpha << regex.match("A") << std::endl;
std::cout << std::boolalpha << regex.match("Å") << std::endl;
std::cout << std::boolalpha << regex.match("Ã") << std::endl;
return 0;
}
Exceptions are thrown on illegal usage (e.g. usage of unsupported regex features):
#include <iostream>
#include <regex/Regex.hpp>
#include <string>
int main()
{
/* Anchors are not supported*/
auto regex = regex::Regex("^Hello World!$");
return 0;
}
throws exception:
terminate called after throwing an instance of 'std::runtime_error'
what(): Error at position 1. Message: Anchors are not supported
Take a look at the unit tests for more examples.
Regular Expression Engine Comparison Chart of all mainstream regular expression engines
Characters | ||
---|---|---|
Feature | Support | Note(s) |
Backslash escapes one metacharacter | YES | |
\Q...\E escapes a string of metacharacters | NO | |
\x00 through \xFF (ASCII character) | NO | See \u0000-\uFFFF \u00000000-\u0010FFFF |
\n (LF) | YES | |
\r (CR) | YES | |
\t (tab) | YES | |
\f (form feed) | YES | |
\v (vtab) | YES | |
\a (bell) | YES | |
\e (escape) | NO | |
\b (backspace) | NO | |
\B (backslash) | NO | |
\\ (backslash) | YES | |
\cA through \cZ (control character) | NO | |
\ca through \cz (control character) | NO | |
Character Classes or Character Sets | ||
Feature | Support | Note(s) |
[abc] character class | YES | |
[^abc] negated character class | YES | |
Hyphen in [\d-z] is a literal | YES | |
Hyphen in [a-\d] is a literal | YES | |
Hyphen in [\w-\d] is a literal | YES | |
Backslash escapes one character class metacharacter | YES | |
\Q...\E escapes a string of character class metacharacters | NO | |
\d shorthand for digits | YES | Equivalent to [0-9] |
\D (negation of \d) | YES | Equivalent to [^0-9] |
\w shorthand for word characters | YES | Equivalent to [a-zA-Z0-9_] |
\W (negation of \w) | YES | Equivalent to [^a-zA-Z0-9_] |
\s shorthand for whitespace | YES | Equivalent to [ \t\r\n\f] |
\S (negation of \s) | YES | Equivalent to [^ \t\r\n\f] |
[\b] backspace | NO | |
Dot | ||
Feature | Support | Note(s) |
. (dot) | YES | Excludes new-line (\n) |
Anchors | ||
Feature | Support | Note(s) |
^ (start of string/line) | NO | |
$ (end of string/line) | NO | |
\A (start of string) | NO | |
\Z (end of string, before final line break) | NO | |
\z (end of string) | NO | |
\` (start of string) | NO | |
\' (end of string) | NO | |
Word Boundaries | ||
Feature | Support | Note(s) |
\b (at the beginning or end of a word) | NO | |
\B (NOT at the beginning or end of a word) | NO | |
\y (at the beginning or end of a word) | NO | |
\Y (NOT at the beginning or end of a word) | NO | |
\m (at the beginning of a word) | NO | |
\M (at the end of a word) | NO | |
\< (at the beginning of a word) | NO | |
\> (at the end of a word) | NO | |
Alternation | ||
Feature | Support | Note(s) |
| (alternation) | YES | |
Quantifiers | ||
? (0 or 1) | YES | |
* (0 or more) | YES | |
+ (1 or more) | YES | |
{n} (exactly n) | YES | |
{n,m} (between n and m) | YES | |
{n,} (n or more) | YES | |
? after any of the above quantifiers to make it "lazy" | NO | |
Grouping and Backreferences | ||
Feature | Support | Note(s) |
(regex) (numbered capturing group) | NO | Used for non-capturing groups instead (see below) |
(regex) (non-capturing group) | YES | This library uses this syntax for non-capturing groups In contrast, most other implementations use it for capturing groups! |
(?:regex) (non-capturing group) | NO | |
\1 through \9 (backreferences) | NO | |
\10 through \99 (backreferences) | NO | |
Forward references \1 through \9 | NO | |
Nested references \1 through \9 | NO | |
Backreferences non-existent groups are an error | NO | |
Backreferences to failed groups also fail | NO | |
Modifiers | ||
Feature | Support | Note(s) |
(?i) (case insensitive) | NO | |
(?s) (dot matches newlines) | NO | |
(?m) (^ and $ match at line breaks) | NO | |
(?x) (free-spacing mode) | NO | |
(?n) (explicit capture) | NO | |
(?-ismxn) (turn off mode modifiers) | NO | |
(?ismxn:group) (mode modifiers local to group) | NO | |
Atomic Grouping and Possessive Quantifiers | ||
Feature | Support | Note(s) |
(?>regex) (atomic group) | NO | |
?+, *+, ++ and {m,n}+ (possessive quantifiers) | NO | |
Lookaround | ||
(?=regex) (positive lookahead) | NO | |
(?!regex) (negative lookahead) | NO | |
(?<=text) (positive lookbehind) | NO | |
(? | NO | |
Continuing from The Previous Match | ||
Feature | Support | Note(s) |
\G (start of match attempt) | NO | |
Conditionals | ||
Feature | Support | Note(s) |
(?(?=regex)then|else) (using any lookaround) | NO | |
(?(regex)then|else) | NO | |
(?(1)then|else) | NO | |
(?(group)then|else) | NO | |
Comments | ||
Feature | Support | Note(s) |
(?#comment) | NO | |
Free-Spacing Syntax | ||
Feature | Support | Note(s) |
Free-spacing syntax supported | NO | |
Character class is a single token | NO | |
# starts a comment | NO | |
Unicode Characters | ||
Feature | Support | Note(s) |
\X (Unicode grapheme) | NO | |
\u0000 through \uFFFF (4 digit Unicode character) | YES | |
\U00000000 through \u0010FFFF (8 digit Unicode character) | YES | |
\x{0} through \x{FFFF} (Unicode character) | NO | |
Unicode Properties, Scripts and Blocks | ||
\pL through \pC (Unicode properties) | NO | |
\p{L} through \p{C} (Unicode properties) | NO | |
\p{Lu} through \p{Cn} (Unicode property) | NO | |
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties) | NO | |
\p{IsL} through \p{IsC} (Unicode properties) | NO | |
\p{IsLu} through \p{IsCn} (Unicode property) | NO | |
\p{Letter} through \p{Other} (Unicode properties) | NO | |
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property) | NO | |
\p{IsLetter} through \p{IsOther} (Unicode properties) | NO | |
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property) | NO | |
\p{Arabic} through \p{Yi} (Unicode script) | NO | |
\p{IsArabic} through \p{IsYi} (Unicode script) | NO | |
\p{BasicLatin} through \p{Specials} (Unicode block) | NO | |
\p{InBasicLatin} through \p{InSpecials} (Unicode block) | NO | |
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block) | NO | |
Part between {} in all of the above is case insensitive | NO | |
Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin) |
NO | |
\P (negated variants of all \p as listed above) | NO | |
\p{^...} (negated variants of all \p{...} as listed above) | NO | |
Named Capture and Backreferences | ||
Feature | Support | Note(s) |
(?regex) (.NET-style named capturing group) | NO | |
(?'name'regex) (.NET-style named capturing group) | NO | |
\k (.NET-style named backreference) | NO | |
\k'name' (.NET-style named backreference) | NO | |
(?Pregex) (Python-style named capturing group | NO | |
(?P=name) (Python-style named backreference) | NO | |
multiple capturing groups can have the same name | NO | |
XML Character Classes | ||
Feature | Support | Note(s) |
\i, \I, \c and \C shorthand XML name character classes | NO | |
[abc-[abc]] character class subtraction | NO | |
POSIX Bracket Expressions | ||
Feature | Support | Note(s) |
[:alpha:] POSIX character class | NO | |
\p{Alpha} POSIX character class | NO | |
\p{IsAlpha} POSIX character class | NO | |
[.span-ll.] POSIX collation sequence | NO | |
[=x=] POSIX character equivalence | NO |
This repository employs the following practices to achieve a reasonable level of quality:
- Modern CMake to build, test, and install the library.
- Continuous integration that:
- builds on GNU g++
- builds on clang++
- runs unit tests
- runs static code analysis (clang-tidy)
- runs dynamic code analysis (Valgrind)
- analyzes unit test code coverage
- builds with many compiler warnings enabled