/Azuki

A regular expression engine implemented with virtual machine approach. Python wrapper provided.

Primary LanguageC++MIT LicenseMIT

Azuki

license

Azuki is a simple regular expression engine implemented with virtual machine approach.
Azuki是一个简单的、基于虚拟机方法实现的正则表达式引擎。

The name Azuki is Japanese アズキ(read bean).
名字Azuki来自日语アズキ(红豆)。

Introduction

Azuki is basically a C++ version of Russ Cox's re1, but supports more regex syntax and operations.

  • Boost Spirit is used to parse regular expression into syntax tree.
  • Syntax tree is compiled into program like Russ Cox's re1.
  • Nondeterministic finite automaton is simulated through “thread”s; a virtual machine runs Thompson's algorithm.
  • Submatch tracking is recorded in each thread's state.

The corresponding program for regular expression "a+b" is:

I0 CHAR 'a'
I1 SPLIT I2 I3
I2 JMP I0
I3 CHAR 'b'
I4 MATCH

Dependencies

To build Azuki, you need a compiler supporting C++17, CMake, Google Test and Boost installed.
My environment is:

  • Clang Apple LLVM version 9.0.0 (clang-900.0.39.2)
  • CMake 3.10.1
  • Google Test 1.8.0
  • Boost 1.66.0
  • Python 2.7.10

Once installed these dependencies, build Azuki and run tests with commands:

mkdir build
cd build
cmake ..
make && make test

Usage

Quick Start

Briefly speaking, we use CreateMachine to create a virtual machine from raw regular expression, then we can detect match substrings with RegexSearch, or replace them with RegexReplace.

Example 1

Detect whether input string contains pattern "a+b".

Azuki::Machine m = Azuki::CreateMachine("a+b");
Azuki::RegexSearch(m, "aab");   // true
Azuki::RegexSearch(m, "ac");    // false

Example 2

Run RegexSearch through input string to find all substrings match pattern "(a+)(b)", and print capturing groups in each substring:

Azuki::Machine m = Azuki::CreateMachine("(a+)(b)");
Azuki::MatchResult result;
while (Azuki::RegexSearch(m, "aabcdabe", result)) {
  for (auto &sub : result.capture)
    std::cout << sub << ' ';
  std::cout << std::endl;
}

The output will be:

aa b
a b

Example 3

Replace substring matches with "(a+)b" in input string with new substring specified by format string "$0c". "$0" is the 0th capturing group, which is the "a+" in "(a+)b".

Azuki::Machine m = Azuki::CreateMachine("(a+)b");
Azuki::RegexReplace(m, "daabe", "$0c");   // "daace"

Check file src/azuki.h for detailed guide.

Regex Syntax

Effect Usage Match Skip
| alternation a | b "a", "b" "c"
? match item 0 or 1 time a?b "b", "ab" "aa", "aab"
+ match item 1 or more times a+ "a", "aaa" "", "b"
* match item 0 or more times a*b "b", "aab" "ac"
^ match begin ^a "a", "abc" "b", "ba"
$ match end a$ "a", "bca" "b", "ab"
\w match any word character \w+ "a", "ab" " ", "123"
\d match any digit \d+ "123" " ", "ab"
\s match any whitespace \s+ " ", "\t" "ab", "123"
[c1-c2] match character in range [c1, c2] [a-c] "a", "c" "d", "1"
{t1,t2} match item repeating allowed times [t1, t2] a{2,3} "aa", "aaa" "a", "ba"

The {} operator also supports two variants. {t} matches item repeating exactly t tiems, {t,} matches item repeating at least t times.

Python Support

Azuki provides a python wrapper through Boost.Python.
After running make, copy pyazuki.so under build/python directory to whatever directory you like, in that directory you can use Python to play with Azuki:

import pyazuki
machine = pyazuki.CreateMachine("(a+)b")
s = "daabe"
print pyazuki.RegexReplace(machine, s, "$0c", True)

Check file python/demo.py for more examples.

Reference

TODO

  • non-capturing group
  • shorthand character classes(\d, \w, \s)
  • escape special characters('^', '$', '(', ')', etc)
  • character and numerical ranges([a-c], [1-2])
  • curly bracket quantification ({2, 5})
  • regex replace