Name

libsregex - A non-backtracking NFA/DFA-based Perl-compatible regex engine library for matching on large data streams

Name
Status
Syntax Supported
API
Examples
Installation
Test Suite
TODO
Author
Copyright and License
See Also

Status

This library is already quite usable and some people are already using it in production.

Nevertheless this library is still under heavy development. The API is still in flux and may be changed quickly without notice.

This is a pure C library that is designed to have zero dependencies.

No pathological regexes exist for this regex engine because it does not use a backtracking algorithm at all.

Already rewrote the code base of Russ Cox's re1 library using the nginx coding style (yes, I love it!), also incorporated a clone of the nginx memory pool into it for memory management.

Already ported the Thompson and Pike VM backends to sregex. The former is just for yes-or-no matching, and the latter also supports sub-match capturing.

Implemented the case-insensitive matching mode via the SRE_REGEX_CASELESS flag.

The full streaming matching API for the sregex engine has already been implemented, for both the Pike and Thompson regex VMs. The sub-match capturing also supports streaming processing. When the state machine is yielded (that is, returning SRE_AGAIN on the current input data chunk), sregex will always output the current value range for the $& sub-match capture in the user-supplied ovector array.

Almost all the relevant test cases for PCRE 8.32 and Perl 5.16.2 have been imported into sregex's test suite and all tests are passing right now.

Already implemented an API for assembling multiple user regexes and returning an ID indicating exactly which regex is matched (first), as well as the corresponding sub-match captures.

There is also a Just-in-Time (JIT) compiler targeting x86_64 for the Thompson VM.

Syntax Supported

The following Perl 5 regex syntax features have already been implemented.

^             match the beginning of lines
$             match the end of lines

\A            match only at beginning of stream
\z            match only at end of stream

\b            match a word boundary
\B            match except at a word boundary

.             match any char

[ab0-9]       character classes (positive)
[^ab0-9]      character classes (negative)

\d            match a digit character ([0-9])
\D            match a non-digit character ([^0-9])

\s            match a whitespace character ([ \f\n\r\t])
\S            match a non-whitespace character ([^ \f\n\r\t])

\h            match a horizontal whitespace character
\H            match a character that isn't horizontal whitespace

\v            match a vertical whitespace character
\V            match a character that isn't vertical whitespace

\w            match a "word" character ([A-Za-z0-9_])
\W            match a non-"word" character ([^A-Za-z0-9_])

\cK           control char (example: VT)

\N            match a character that isn't a newline

ab            concatenation; first match a, and then b
a|b           alternation; match a or b

(a)           capturing parentheses
(?:a)         non-capturing parantheses

a?            match 1 or 0 times, greedily
a*            match 0 or more times, greedily
a+            match 1 or more times, greedily

a??           match 1 or 0 times, not greedily
a*?           match 0 or more times, not greedily
a+?           match 1 or more times, not greedily

a{n}          match exactly n times
a{n,m}        match at least n but not more than m times, greedily
a{n,}         match at least n times, greedily

a{n}?         match exactly n times, not greedily (redundant)
a{n,m}?       match at least n but not more than m times, not greedily
a{n,}?        match at least n times, not greedily

The following escaping sequences are supported:

\t          tab
\n          newline
\r          return
\f          form feed
\a          alarm
\e          escape
\b          backspace (in character class only)
\x{}, \x00  character whose ordinal is the given hexadecimal number
\o{}, \000  character whose ordinal is the given octal number

Escaping a regex meta character yields the literal character itself, like \{ and \..

Only the octet mode is supported; no multi-byte character encoding love (yet).

API

This library provides a pure C API. This API is still in flux and may change in the near future without notice.

openresty/sregex

Name

Table of Contents

Status

Syntax Supported

API

Constants

Memory pool API

sre_create_pool

sre_destroy_pool

sre_reset_pool

Regex parsing and compilation API

sre_regex_parse

sre_regex_parse_multi

sre_regex_compile

Regex execution API

Thompson VM

sre_vm_thompson_create_ctx

sre_vm_thompson_exec

Just-In-Time Support for Thompson VM

sre_vm_thompson_jit_compile

sre_vm_thompson_jit_get_handler

sre_vm_thompson_jit_create_ctx

Pike VM

sre_vm_pike_create_ctx

sre_vm_pike_exec

Examples

Installation

Test Suite

TODO

Author

Copyright and License

See Also