gawk-aregex

Gawk extension for approximate regex (fuzzy) matching, using the TRE regex library from the TRE library (also here).

Provides an amatch() function, roughly equivalent to the built-in match() function in gawk. For documentation of this function and example usage, please see the man page.

As of 2019-01-18, this gawk extension is now incorporated into the combined Gawkextlib project, and appears here.

Installation

(Install gawk, version 4.2+)
Install gawkextlib (AUR, Fedora). Tested with version 1.0.4.
Install TRE (Arch, Fedora). Tested with version 0.8.0.
Make sure the libraries (libgawkextlib.so and libtre.so) and header files (gawkapi.h and tre/tre.h) can be found by the compiler. Add -Ldir and Idir arguments to gcc if needed.
Compile with make
Test with make check
Set install location PREFIX in Makefile. Install with make install

Alternative installation

The file aregex.c can be easily incorporated into the standard gawkextlib build chain:

  git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code
  cd gawkextlib-code
  ./make_extension_directory.sh -g /.../local/bin/ -l /.../local/lib/ \
    -I aregex "Name" "email"
  cd aregex
  cp -f .../aregex.c .
  sed -i '7 i \#include "common.h"' aregex.c 
  ./configure # --prefix=/.../local/ 
  sed -i 's/-lgawkextlib/-lgawkextlib -ltre/g' Makefile
  make
  make install

The Makefile in sf_build will do this.

Windows users

There is an up-to-date (version 5+) gawk for Windows at the ezwinports project. I looked briefly into cross-compiling aregex.c with i686-w64-mingw32, but this would require also cross-compiling TRE and gawkextlib, which is beyond me (given time available). Windows users will need to locate a Linux or Windows machine.

A note on bytes and characters

While the amatch() function is roughly equivalent to the gawk match() function, I chose not to return [i,"start"] position and [i,"length"] in the returned substring array (e.g., see here), but to return just the literal substring for each parenthetical match. Gawk is multibyte aware, and match() works in terms of characters, not bytes, but TRE seems not to be character-based. Using the wchar_t versions of tre_regcomp() and tre_regaexec() does not help if the input is a mix of single and multi-byte characters.

A simple routine must be used on the output array (out), if positions and lengths of the substrings are needed:

  print "i", "substring", "posn", "length"
  p = 1
  for (i = 1; i < length(out); i++) {
    idx = index(substr(str, p), out[i])
    len = length(out[i])
    print i, out[i], idx+p-1, len
    p = p + idx + len
  }

Thanks to...

Ville Laurikari (@laurikari) for TRE
Arnold Robbins for maintaining Gawk
gawkextlib developers, and the developers of other extensions for their examples
Benjamin Eckel (@bhelx) for this gist.
user sashoalm on StackOverflow for this answer.
User Stefan on StackOverflow for this answer.
Jannick (Github: @jannick0) for major enhancements

Cam Webb cw@camwebb.info, 2020-09-29