Gawk extension for approximate regex (fuzzy) matching, using the TRE regex library from the TRE library (also here).
Provides an amatch()
function, roughly equivalent to the built-in
match()
function in gawk. For documentation of this function and
example usage, please see the man page.
As of 2019-01-18, this gawk extension is now incorporated into the combined Gawkextlib project, and appears here.
- (Install gawk, version 4.2+)
- Install gawkextlib (AUR, Fedora). Tested with version 1.0.4.
- Install TRE (Arch, Fedora). Tested with version 0.8.0.
- Make sure the libraries (
libgawkextlib.so
andlibtre.so
) and header files (gawkapi.h
andtre/tre.h
) can be found by the compiler. Add-Ldir
andIdir
arguments togcc
if needed. - Compile with
make
- Test with
make check
- Set install location
PREFIX
inMakefile
. Install withmake install
The file aregex.c
can be easily incorporated into the standard
gawkextlib build chain:
git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code
cd gawkextlib-code
./make_extension_directory.sh -g /.../local/bin/ -l /.../local/lib/ \
-I aregex "Name" "email"
cd aregex
cp -f .../aregex.c .
sed -i '7 i \#include "common.h"' aregex.c
./configure # --prefix=/.../local/
sed -i 's/-lgawkextlib/-lgawkextlib -ltre/g' Makefile
make
make install
The Makefile in sf_build
will do this.
There is an up-to-date (version 5+) gawk
for Windows at the
ezwinports project. I
looked briefly into cross-compiling aregex.c
with
i686-w64-mingw32
, but this would require also cross-compiling TRE
and gawkextlib, which is beyond me (given time available). Windows
users will need to locate a Linux or Windows machine.
While the amatch()
function is roughly equivalent to the gawk
match()
function, I chose not to return [i,"start"]
position and
[i,"length"]
in the returned substring array (e.g., see here),
but to return just the literal substring for each parenthetical
match. Gawk is multibyte aware, and match()
works in terms of
characters, not bytes, but TRE seems not to be character-based. Using
the wchar_t
versions of tre_regcomp()
and tre_regaexec()
does
not help if the input is a mix of single and multi-byte characters.
A simple routine must be used on the output array (out
), if
positions and lengths of the substrings are needed:
print "i", "substring", "posn", "length"
p = 1
for (i = 1; i < length(out); i++) {
idx = index(substr(str, p), out[i])
len = length(out[i])
print i, out[i], idx+p-1, len
p = p + idx + len
}
- Ville Laurikari (@laurikari) for TRE
- Arnold Robbins for maintaining Gawk
gawkextlib
developers, and the developers of other extensions for their examples- Benjamin Eckel (@bhelx) for this gist.
- user sashoalm on StackOverflow for this answer.
- User Stefan on StackOverflow for this answer.
- Jannick (Github: @jannick0) for major enhancements
Cam Webb cw@camwebb.info, 2020-09-29