/minrx

Minimal matcher for POSIX Extended Regular Expressions

Primary LanguageC++BSD 2-Clause "Simplified" LicenseBSD-2-Clause

MinRX: A minimalist POSIX Extended Regular Expression matcher

Description

MinRX is library for matching POSIX Extended Regular Expressions (EREs).

MinRX is written in C++ 20, but exports a C API similar to the POSIX <regex.h> functions, with C entry points for both counted and NUL-terminated strings. The MinRX function and macro names are generally constructed by prepending minrx_ or MINRX_ to the names of their POSIX <regex.h> counterparts.

My goal for MinRX is to eventually have performance competitive with the fastest extant matchers, but for now the development focus is on correctness and simplicity. When I am confident that the core algorithm is stable and well-tested, I plan to rewrite MinRX in C to improve portability and perhaps further reduce resource usage.

MinRX attempts to implement the LC_CTYPE-dependent locale features specified for POSIX regular expressions (with some limitations because MinRX can't directly access the host operating system's private locale database, and POSIX doesn't provide a portable API for some necessary queries). These features consist of the obvious support for international character sets and multibyte character encodings, together with some syntax for bracket expressions that allows many traditional uses of bracket expressions to be expressed in a way independent of character set and character encoding.

In order to fully support international locales, the core algorithms of MinRX operate on 32-bit wide characters, loosely assumed to contain Unicode code points. In UTF-8 LC_CTYPE locales, MinRX uses its own custom code to convert the input to 32-bit codepoint values. In non-UTF-8 locales, MinRX uses the C99 mbrtowc() function to transform input to its internal representation. It is implementation-dependent whether mbrtowc() translates to Unicode code points (under GNU libc it does). MinRX also has an option to force the use of native encoding in 1-byte LC_CTYPE locales.

MinRX is a non-backtracking matcher: it makes a single forward scan through the search text, one character at a time, and considers all possible matches in parallel.

MinRX runs in O(M * P) space and roughly O(M * N) time, where M is the overall regular expression length, N is the search string length, and P is the maximum parentheses nesting depth in the regular expression.

Note that MinRX does not support POSIX Basic Regular Expressions (BREs). Thanks to the inclusion of backreferences, BREs are not true regular expressions at all, and BREs with backreferences do not correspond to finite automata. BRE matchers typically have worst-case time complexity that is exponential. Although a mutant version of the MinRX algorithm could probably be constructed to match BREs, it would almost certainly have exponential worst-case space complexity as well, and so would be unlikely to be broadly useful.

A detailed description of MinRX's algorithm can be found in ALGORITHM.txt.

Features

MinRX is a nearly-feature-complete implementation of POSIX 2018 EREs, with just two caveats that I'm aware of:

  • The [. .] syntax inside bracket expressions for "collating element names" is only partially implemented. MinRX supports use of [. .] for quoting single characters, but does not support multi-character collating element names inside [. .]. As far as I know, there is no portable POSIX API for looking up the locale information that would be needed to implement this syntax.

  • The [= =] syntax inside bracket expressions for "collating element equivalence classes" is entirely unimplemented. Again, as far as I know there is no portable POSIX API that gives access to access to the necessary locale information.

MinRX also provides a few BSD extensions (\< \>) and GNU extensions (\` \' \b \B \s \S \w \W) that can be optionally enabled via flags passed at regexp compile time, as well as a few syntax compatibility options needed to enable use in GNU awk.

Installation

The build system is a mix of GNU make and meson. The included GNU Makefile provides targets compile, install, and uninstall that invoke the meson build, e.g. make PREFIX=/some/dir install.

There are two test programs rxgrep and tryit that are also built by make compile. These will print usage messages if invoked with no arguments. (These are not installed by make install).

If you don't have meson you can make rxgrep and/or make tryit to build the test programs.

make clean removes all build artifacts (both meson and traditional).

Tools

The program rxgrep is a minimal version of egrep, using MinRX for matching. It accepts the standard POSIX options for grep, but not -E or -F, as currently MinRX only supports Extended Regular Expressions. For the standard options, it also accepts the same corresponding long options as used by GNU grep, for compatibility (as far as it goes) with that program. rxgrep --help prints a short summary of the available options.

Do not have high expectations for rxgrep performance. The code was written with simplicity and correctness in mind, not speed, and the MinRX matcher itself is currently slow.

Future plans

My current development focus is correctness. I'm hoping the public will thoroughly exercise this matcher and bury me in a deluge of bug reports. :-)

At some point I will switch focus to performance and portability.

Currently planned work:

  • "First character" optimization to speed up initial search.

  • Implement POSIX 2024 non-greedy repetition operators.

  • Improve support for POSIX bracket expression collating elements and implement POSIX bracket expression equivalence classes.

  • Possibly a caching-DFA-like optimization: The MinRX SNFA automaton, with its stack of arbitrary integers, does not in general correspond to any deterministic finite automaton, but it is equivalent to a "deterministic infinite automaton", and it might be feasible to cache a useful working subset of this infinite automaton's states.

  • Rewrite in C for improved portability.

License

This code is released under the GNU Lesser General Public License (GPL), version 3 or any later version.

I want to preserve the ability to relicense this code in the future under a more-permissive license, such as BSD or LGPL-v2.1. So I will only accept contributions from contributers who agree in advance that their contributions can be relicensed under BSD or LGPL-v2.1 licenses.

The text of the GNU GPL licenses is included in the files COPYING and COPYING.LESSER.

The manual page minrx.3 (contributed by Arnold Robbins) is under its own license, the text of which can be found in that file.

Acknowledgements

Arnold Robbins pestered me for years to write this matcher, and enthusiastically tested numerous early versions of it with GNU awk. He contributed the manual page and the rxgrep program.

The meson build was contributed by shenleban tongying.

Development of this matcher also benefitted immensely from a POSIX regular expression test suite developed by Douglas McIlroy, Glenn Fowler, and others.