A work-in-progress implementation of AWK on top of GraalVM.
The eventual goal of this project is compatiblity with the POSIX specification of awk, with high performance on large, well-structured input files.
This is mostly a personal "for fun" project.
I heard about how Truffle lets you write an interpreter for a language, and get a high-performance JIT runtime for "free."
At my day job (bioinformatics), I often use AWK to operate on gigabyte+ files that have a pretty regular structure. I think that a JIT could lend some really nice speedups there - automatically specializing at runtime for what fields matter to the AWK program, and the format of the fields themselves.
This code should be considered pre-alpha - don't use it for anything that matters.
This project was forked from SimpleLanguage. Most of the internals still refer to the language as SL
.
Very very simple AWK programs work, like NR > 1 { print($NF); }
.
We also have decent function support (thanks SimpleLanguage!), like:
function fib(n) {
if(n <= 1) { return 1; }
return fib(n-1) + fib(n-2);
}
1 == 1 {print(fib(NR));}
Roadmap:
- Basic math (integers only)
- User function definition and invocation
- AWK rule/action structure
- Regex support
- Partial
print
function support - Support field reads (
$1
,$2
, ...) - Partial support for built-in variables (
NR
,NF
, ...) - Field writes (
$1 = "hello"
, ...) - Built-in variable writes (
OFS = ","
, ...) -
BEGIN
andEND
patterns - Additional math operators (
x %= 2
,2^4
, ...) - Non-integer numeric types (float, double)
- Math functions (
sin
,sqrt
,rand
, ...) - String functions (
length
,sub
,tolower
, ...) - Arrays
- Input/output redirection (
getline < "myfile.txt"
, ...) -
system
function for invoking other utils - Implicit string concatentation (
"hello," "Ada" == "hello,Ada"
) - Syntax conformation (e.g: call
print
without parens, optional semicolons at end of blocks) - Implicit string->numeric casting for "numeric strings".
- Exit status code
- Command-line parameters
- Resolve syntactical differences (e.g: optional semicolons at end of blocks)
- Comb over operator precedence rules
- Locale support
- Fuzzing for errors
- Fuzzing for catching differences from other implementations
- Optimization
- More stuff that I've forgotten to list here
- If I go absolutely hog-wild - automatic parallelization?
While it is much too early to make any bold claims, initial testing on the subset of AWK currently supported indicates that graawk
can be faster (for some toy programs) than gawk
or mawk
on input sizes > 5,000,000 lines.
graawk
is definitely slower than other AWK implementations for small data, and it will probably remain so.
TBD! Eventually, I'd like to release all of this under some open-source license, but I haven't thought too hard about which specific one.
SimpleLanguage (whose code still forms the bulk of this project) is licensed under UPL. My own changes are not yet licensed, which as I understand it, means All Rights Reserved. If you're an individual, I'm probably not going to get mad at you for using it, but if you're making profit, I probably will be.
If you have opinions about licensing, let me know. :)
Disclaimer: I'm not interested in PRs or co-development at this early stage. The code is messy as heck, I've got a ton of weird ideas about where it's going, etc. If you really want to work on this, please contact me first.
That said, here are some instructions to develop or "install" the software:
- Install GraalVM, version 22. (Higher versions may work.)
- Install the necessary GraalVM plugins, like
native-image
. - Install Maven.
- Run
mvn package
- Run
./sl myprogram.awk < input.txt > output.txt
Kind of like a pterodactyl would. Or, you can pronounce it like grok, which rhymes with AWK, at least in my dialect.