jq startup has gotten slow due to growth in builtins and linking inefficiencies
h2suzuki opened this issue · 6 comments
Thank you for this great command jq!
Recently, I have compiled jq-1.5 (master branch HEAD) on Raspberry Pi3, and found significant slow down since jq-1.4.
Here is the information I found:
jq-1.4
Note that ./configure is done without any options. Within the build directory, I did the following commands.
# ./jq --version
jq-1.4
# time echo '{}' | strace -wc ./jq .
{}
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
17.64 0.000960 69 14 mmap2
16.66 0.000907 907 1 execve
12.36 0.000673 61 11 read
10.73 0.000584 65 9 mprotect
8.49 0.000462 77 6 1 open
6.14 0.000334 48 7 fstat64
6.01 0.000327 109 3 munmap
4.80 0.000261 44 6 close
4.74 0.000258 65 4 3 access
4.65 0.000253 42 6 lseek
2.70 0.000147 49 3 brk
1.47 0.000080 80 1 write
1.10 0.000060 60 1 ioctl
0.94 0.000051 51 1 cacheflush
0.83 0.000045 45 1 uname
0.75 0.000041 41 1 set_tls
------ ----------- ----------- --------- --------- ----------------
100.00 0.005443 75 4 total
real 0m0.038s
user 0m0.010s
sys 0m0.010s
jq-1.5 HEAD
For 1.5, ./configure --without-oniguruma is done.
# ./jq --version
jq-1.5rc2-237-gcb3d5af
# time echo '{}' | strace -wc ./jq .
{}
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
15.36 0.000950 68 14 mmap2
14.82 0.000917 102 9 brk
14.57 0.000901 901 1 execve
9.75 0.000603 55 11 read
9.41 0.000582 65 9 mprotect
7.44 0.000460 77 6 1 open
5.38 0.000333 48 7 fstat64
5.04 0.000312 104 3 munmap
4.30 0.000266 44 6 close
4.28 0.000265 66 4 3 access
4.17 0.000258 43 6 lseek
1.39 0.000086 86 1 write
0.97 0.000060 60 1 ioctl
0.86 0.000053 53 1 getcwd
0.84 0.000052 52 1 cacheflush
0.79 0.000049 49 1 uname
0.63 0.000039 39 1 set_tls
------ ----------- ----------- --------- --------- ----------------
100.00 0.006186 82 4 total
real 0m0.895s
user 0m0.830s
sys 0m0.010s
You can see the elapsed time becomes almost 1 second for this minimum JSON and 23.5 times longer than 1.4.
Here is another trace of jq-1.5.
# time echo '{}' | strace -r ./jq .
0.000000 execve("./jq", ["./jq", "."], [/* 47 vars */]) = 0
0.001837 brk(0) = 0x12f5000
0.000298 uname({sys="Linux", node="raspberry", ...}) = 0
0.000201 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
0.000242 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76fa1000
0.000226 access("/etc/ld.so.preload", R_OK) = 0
0.000201 open("/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
0.000222 fstat64(3, {st_mode=S_IFREG|0644, st_size=42, ...}) = 0
0.000219 mmap2(NULL, 42, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x76fa0000
0.000214 close(3) = 0
0.000240 open("/usr/lib/arm-linux-gnueabihf/libarmmem.so", O_RDONLY|O_CLOEXEC) = 3
... (omitted) ...
0.000451 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9f000
0.000328 set_tls(0x76f9f4c0, 0x76f9fba8, 0x76fa4050, 0x76f9f4c0, 0x76fa4050) = 0
0.000473 mprotect(0x76ec0000, 8192, PROT_READ) = 0
0.000299 mprotect(0x76f3f000, 4096, PROT_READ) = 0
0.000240 mprotect(0x76f5f000, 20480, PROT_READ|PROT_WRITE) = 0
0.000300 mprotect(0x76f5f000, 20480, PROT_READ|PROT_EXEC) = 0
0.000257 cacheflush(0x76f5f000, 0x76f64000, 0, 0x15, 0x7ea53dd0) = 0
0.000313 mprotect(0x76fa3000, 4096, PROT_READ) = 0
0.000226 munmap(0x76f41000, 120366) = 0
0.001042 brk(0) = 0x12f5000
0.000234 brk(0x1316000) = 0x1316000
0.000392 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
0.000752 getcwd("/root/Downloads/jq", 4096) = 19
0.000663 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
0.032603 brk(0x1337000) = 0x1337000
0.018625 brk(0x1358000) = 0x1358000
0.041019 brk(0x1379000) = 0x1379000
0.808403 brk(0x139a000) = 0x139a000
0.389545 brk(0x1399000) = 0x1399000
0.001031 brk(0x131e000) = 0x131e000
0.000708 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
0.000251 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9e000
0.000291 read(0, "{}\n", 4096) = 3
0.000443 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
0.000235 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9d000
0.000268 write(1, "\33[1;39m{}\33[0m\n", 14{}
) = 14
0.000272 read(0, "", 4096) = 0
0.000215 close(1) = 0
0.000164 munmap(0x76f9d000, 4096) = 0
0.000430 exit_group(0) = ?
0.000826 +++ exited with 0 +++
real 0m1.445s
user 0m0.990s
sys 0m0.020s
The major difference comes from brk() system calls after open("/root/.jq", O_RDONLY).
0.032603 brk(0x1337000) = 0x1337000
0.018625 brk(0x1358000) = 0x1358000
0.041019 brk(0x1379000) = 0x1379000
0.808403 brk(0x139a000) = 0x139a000
0.389545 brk(0x1399000) = 0x1399000
0.001031 brk(0x131e000) = 0x131e000
0.808403 and 0.389545 seconds are not taken by brk() itself as indicated by strace -wc
These are implying something heavy going on in the user space between these system calls.
jq-1.4 does not have these brk() calls at all. The problematic part is introduced by jq-1.5.
I put strace output of jq-1.4 below after open("/root/.jq", O_RDONLY)
.
... (omitted) ...
0.000165 munmap(0x76ead000, 120366) = 0
0.000882 brk(0) = 0x1213000
0.000163 brk(0x1234000) = 0x1234000
0.000590 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
0.008392 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
0.000301 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f0a000
0.000215 read(0, "{}\n", 4096) = 3
0.000333 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
0.000217 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
0.000280 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f09000
0.000241 write(1, "\33[37m{}\33[0m\n", 12{}
) = 12
0.000217 read(0, "", 4096) = 0
0.000171 close(0) = 0
0.000135 munmap(0x76f0a000, 4096) = 0
0.000497 exit_group(0) = ?
0.000587 +++ exited with 0 +++
real 0m0.043s
user 0m0.020s
sys 0m0.010s
Just to make it clear, we have much free memory, so the issue is not triggered by memory shortage.
# free -h
total used free shared buffers cached
Mem: 733M 616M 116M 55M 86M 340M
-/+ buffers/cache: 189M 543M
Swap: 370M 36M 334M
Thank you in advance!
Perhaps something in the JSON parser has started doing more allocations than it should. That's all I can think of right now. Thanks for this report!
So, I ran jq 1.5 (the release branch) through callgrind, and it looks like we're spending most of our time in... block_bind_subblock
! Though, this isn't particularly surprising (see also: #1380)
For posterity, I've attached is the call graph generated by callgrind (and kcachegrind).
As such, I don't think this is an issue, so much as it is more reason to actually deal with #1380.
As for the extra calls to brk()
, I'd expect them to be because we have so many more builtins now and they all get compiled/bound/stripped every time we compile a jq program.
@wtlangford Good job. Yeah, we need to optimize binding!
@wtlangford It would be useful to know how much more time we spend in block_bind_subblock()
called from yyparse()
versus from linking. See my commentary on #1380.
It's hard to say... kcachegrind is not my favorite tool, and this is all too fast overall for something like gprof (which I'm a bit better with).
Notes about these times:
- callgrind doesn't actually measure time. It instead measures things like "instructions fetched" and "data read". I'm using "instructions fetched" as a reference point for time. Obviously that's a bit fuzzy, but it should do.
- Percentages are relative to the the caller (unless otherwise stated).
This is all based off of an invocation of jq that looks like this:
echo '{}' | valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./jq .
Onwards to data!
Almost all of our time (99.84% of the total process time) is spent in jq_compile_args() -> builtins_bind()
Here's a breakdown of that time:
- 26.08% in
builtins_bind_one()
(injq_parse_library()
) - 46.43% in
gen_cbinding()
(in repeated calls toblock_bind()
) - 25.56% in
block_drop_unreferenced()
(inblock_count_refs()
)
All of the time spent in jq_parse_library()
is spent in yyparse()
(which makes sense, given that the program we're compiling is just .
. I'd expect a more complex program to spend time in block_bind()
)
Looking at the callers of block_bind_subblock()
yields the following breakdown of time spent in block_bind_subblock()
:
- 31.17% comes from
yyparse()
- 64.77% comes from
gen_cbinding()
Lots of numbers here, and I'm not sure how well I trust them, but the expectation is that spending less time binding/binding more efficiently will speed things up.
Thanks @wtlangford, that confirms my thinking that just going from O(N_insts * M_builtins)
to O(N_insts * log(M_builtins))
would be a big win (especially for tiny programs like .
). Most jq programs aren't that large, so that should be a big win. Precompiling builtins might be a huge win too, though that would be a significant complication.