jqlang/jq

Linking is O(M_binders * N_body_instructions), could be O(N' log M) or even O(N'), where N' is just unbound instructions

nicowilliams opened this issue · 9 comments

Linking is O(M_binders * N_body_instructions), could be O(N' log M) or even O(N'), where N' is just unbound instructions.

As we add builtins, linking will slow.

We've already optimized binding of global $vars defined with --arg*.

We could now optimize the binding of builtins and binders exported by modules.

There are two independent things to do about this:

  • Build a list of unbound instructions as we do any binding so that once we have that list we need only traverse it (this is where N' comes from).

  • Sort the globals to bind by name and ariness, then traverse the body (or list of unbound insts in it) to find unbound insts and then binary search the table of binders (this is where log M comes from).

    Alternatively build a hash table of the globals to get down to O(N') (assuming we really get O(1) from the hash table).

    We could build the sorted / hash tables as we compile a module.

    The cost of sorting or building a hash table has to be taken into account. For large programs it will be negligible, as it is a one-time cost, but most jq programs are small... A hash table will probably perform best in this sense, since insertion should be fast too.

    The C-coded builtins can be kept sorted manually. No need to sort it at run-time, nor hash it.

    src/builtin.jq can't be sorted because some of those builtins depend on others.

We can add inst *next_unbound and inst *prev_unbound fields to inst, but where inst values are generated we don't have the context to link them up. So such a list would have to be setup the first time we bind.

  • EDIT: But that can't be right either: the compiler binds subblocks as it goes, but we can't really make the list then because then we'll end up with a bunch of disjoint lists without a single link into any of them unless we do a bunch more work to pass a compiler context down to every function in src/compile.c.

    We can still make such a list at the end when we block_bind_library() (which gets called for each module -- because of namespacing considerations we can't have a global unbound inst list!).

    We can avoid adding a compiler context argument by using a thread-local, which... I guess we could do, but in any case, a great deal of care is needed to deal with block_take(), block_join(), BLOCK(), and so on, to preserve a coherent list of unbound instructions... Though perhaps all we need is to make sure that inst_free() removes unbound insts from the list. Food for thought!

  • EDIT: It can't be a list of unbound insts. It has to be a tree. Unless it's only for binding "global" symbols as when calling block_bind_library().

  • EDIT: It really has to be that block_bind_subblock() gets optimized. The way to do this is to make a sub-list of unbound insts per-argument closure, per-sub-function, and per-function body and top-level program. That makes sense! We'll still recurse down the parse tree, but we won't visit every inst. It's not clear that this will be faster though, as we'll be bloating inst. We should start by sorting and binary searching the binders.

Another thought is that we may eventually want to close the set of builtins for the "jq" module (as it were) or otherwise compile only the ones we need.

I think I'll do the following:

  • phase 1: use binary search of binder arrays instead of lists

    • keep block_bind_subblock() as-is
    • add a block_bind_subblock2() that takes an array (not a block) of binder insts
    • make gen_cbinding(), bind_bytecoded_builtins(), and friends, build a sorted array of binders and call block_bind_subblock2() to bind them using binary search of the binders for each unbound inst
  • phase 2: use a list of unbound insts for binding global symbols

    • add a block_gather_unbound() function to make a list of unbound insts
    • add a block_bind_unbounds() function that works like block_bind_subblock2() but takes the unbound inst list, and make gen_cbinding(), bind_bytecoded_builtins(), and friends use it
  • phase 3, maybe: build the unbound inst list (one per-body, subfn, and arg)

    • remove block_bind_subblock()
    • make block_bind_subblock2() use unbound inst lists (insts with subfns and args have to appear on the unbound list, even if they are not unbound)

I like the idea of compiling only those we need, but it may be difficult to identify those at runtime. Symbols that reach the end of compilation/binding and still haven't been bound should be builtins. But builtins can also bind against other builtins, so, do we keep digging into some builtin list each time we find a symbol we don't know? The nice thing about builtins right now is that they're just a jq source file...

Closing the jq (builtins) module and possibly even migrating some builtins out of the jq module and into other built-in modules might also be a good idea, though it is the sort of thing that breaks backwards-compatibility.

Food for thought...

Another possibility is pre-byte-compiling the builtins. That would require a serialization format (which would have to preserve binding information). It would also require a two-pass build, first to build a bootstrap jq, and then to build jq.

Came across this after observing some apparently-quadratic linking time in the context of trying to speed up jq "startup".

To provide some context, as it turns out, jq takes quite a bit of of time to do absolutely nothing:

% time (repeat 100 jq -n '' > /dev/null)
( repeat 100; do; jq -n '' > /dev/null; done; )  5.20s user 0.06s system 95% cpu 5.499 total
% time (repeat 100 jq -n 'empty')
( repeat 100; do; jq -n 'empty'; done; )  5.28s user 0.08s system 96% cpu 5.579 total

# Baseline comparison
% time (repeat 100 /bin/true)
( repeat 100; do; /bin/true; done; )  0.00s user 0.03s system 26% cpu 0.120 total

This matters to me because I found myself using jq in a shell script which invokes jq ~3k times, on separate files, which means actually spending several cpu-minutes (which is still like, half a minute) waiting for jq, and as it turns out, this is the majority of the time jq spends when doing just about anything.

So an strace shows that it's not doing anything strange on that from after some instrumenting I concluded that (a) most of the time is spent loading builtins, and (b) most of that time is spent in block_bind_subblock(), which concurs with #1411.

So I tried ripping out all the jq-defined builtins:

% echo > src/builtin.jq
% make
... output ...
% time (repeat 100 ./jq -n empty)
( repeat 100; do; ./jq -n empty; done; )  0.00s user 0.03s system 13% cpu 0.271 total

... which suggests that we really can get down to, if not on par with /bin/true, at least down to that order of magnitude, along with the perl/sed/awk crew (which seems like where we want to be if we're positioned as "sed for JSON data"), instead of halfway between python2 and GHC:

% time (repeat 100 lua -e '')
( repeat 100; do; lua -e ''; done; )  0.02s user 0.01s system 17% cpu 0.207 total
% time (repeat 100 perl -e '')
( repeat 100; do; perl -e ''; done; )  0.01s user 0.03s system 14% cpu 0.248 total
% time (repeat 100 sed '' < /dev/null)
( repeat 100; do; sed '' < /dev/null; done; )  0.01s user 0.03s system 19% cpu 0.183 total
% time (repeat 100 awk '' < /dev/null)
( repeat 100; do; awk '' < /dev/null; done; )  0.01s user 0.03s system 12% cpu 0.297 total
% time (repeat 100 bash -c '')
( repeat 100; do; bash -c ''; done; )  0.01s user 0.03s system 17% cpu 0.210 total
% time (repeat 100 python2 -c '')
( repeat 100; do; python2 -c ''; done; )  1.04s user 0.29s system 86% cpu 1.544 total
% time (repeat 100 python3 -c '')
( repeat 100; do; python3 -c ''; done; )  2.72s user 0.40s system 95% cpu 3.262 total
% time (repeat 100 ghci -e '')
( repeat 100; do; ghci -e ''; done; )  15.44s user 2.90s system 97% cpu 18.822 total

Stepping back a bit. I confirmed that library linking was indeed as quadratic as it looked on inspection:

% jq -nr 'range(1600)|"def x\(.): \(.);"' > a.jq
% jq -nr 'range(3200)|"def x\(.): \(.);"' > b.jq
% jq -nr 'range(6400)|"def x\(.): \(.);"' > c.jq
% time jq -n 'include "a"; empty'
jq -n 'include "a"; empty'  0.18s user 0.00s system 97% cpu 0.180 total
% time jq -n 'include "b"; empty'
jq -n 'include "b"; empty'  0.51s user 0.00s system 99% cpu 0.510 total
% time jq -n 'include "c"; empty'
jq -n 'include "c"; empty'  1.85s user 0.00s system 99% cpu 1.857 total

... but this didn't jive with what I've seen with generated jq programs that contain hundreds of function definitions, which don't take nearly as long to run:

% time jq -n "empty"
jq -n "empty"  0.05s user 0.00s system 93% cpu 0.056 total
% time jq -n "$(cat a.jq) empty"
jq -n "$(cat a.jq) empty"  0.06s user 0.00s system 91% cpu 0.058 total
% time jq -n "$(cat b.jq) empty"
jq -n "$(cat b.jq) empty"  0.05s user 0.01s system 99% cpu 0.058 total
% time jq -n "$(cat c.jq) empty"
jq -n "$(cat c.jq) empty"  0.06s user 0.00s system 98% cpu 0.065 total

# with the no-builtins executable
% time ./jq -n "empty"
./jq -n "empty"  0.00s user 0.00s system 0% cpu 0.003 total
% time ./jq -n "$(cat a.jq) empty"
./jq -n "$(cat a.jq) empty"  0.00s user 0.00s system 63% cpu 0.006 total
% time ./jq -n "$(cat b.jq) empty"
./jq -n "$(cat b.jq) empty"  0.00s user 0.00s system 91% cpu 0.009 total
% time ./jq -n "$(cat c.jq) empty"
./jq -n "$(cat c.jq) empty"  0.02s user 0.00s system 97% cpu 0.016 tota

(We hit ARG_MAX beyond ~7k defs, and the parser bails after 9993 definitions anyway, but this looks O(n) to me and even if it isn't it's Fast Enough.)

This suggests a possible stupid optimization:

% git checkout src/builtin.jq
% time (repeat 100 ./jq -n "$(cat src/builtin.jq) empty" > /dev/null)
( repeat 100; do; ./jq -n "$(cat src/builtin.jq) empty" > /dev/null; done; )  0.38s user 0.13s system 66% cpu 0.762 total

Maybe it would be useful to provide a "don't include builtin.jq" flag? Almost all the jq-defined builtins are definable using documented parts of the jq language, except for *_by (which require their _impl counterparts), indices (depends on _strindices), and match and test (depend on _match_impl).

_input should probably just be exported as input instead of having def input: _input; ...

@muhmuhten It might be possible to apply something of a bandaid... Basically, change builtins_bind() and block_bind_library() to first scan the program for all the unbound symbols and make a jv_object() indexed by those, then when stepping through all the binders, quickly skip each binder if it's not in that object.

I think I'll just add first_unbound, next_unbound, prev_unbound, and last_unbound fields to struct inst, and link into that list all unbound symbols in a block as well as all insts in the block's first..last that have non-empty subfn or arglist. Then in block_bind_subblock() walk body.first_unbound, not body.first, and as each thing is bound, remove it from the first_unbound..last_unbound list.

Linking is now fast again thanks to @muhmuhten's various contributions, especially #1834. Block binding is still quadratic, but jq startup times are now remarkably fast -- as fast as they were back in 1.5.