jq startup has gotten slow due to growth in builtins and linking inefficiencies

Question

jq startup has gotten slow due to growth in builtins and linking inefficiencies

h2suzuki opened this issue 7 years ago · 6 comments

h2suzuki commented 7 years ago

Thank you for this great command jq!

Recently, I have compiled jq-1.5 (master branch HEAD) on Raspberry Pi3, and found significant slow down since jq-1.4.

Here is the information I found:

jq-1.4

Note that ./configure is done without any options. Within the build directory, I did the following commands.

# ./jq --version
jq-1.4

# time echo '{}' | strace -wc ./jq .
{}
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 17.64    0.000960          69        14           mmap2
 16.66    0.000907         907         1           execve
 12.36    0.000673          61        11           read
 10.73    0.000584          65         9           mprotect
  8.49    0.000462          77         6         1 open
  6.14    0.000334          48         7           fstat64
  6.01    0.000327         109         3           munmap
  4.80    0.000261          44         6           close
  4.74    0.000258          65         4         3 access
  4.65    0.000253          42         6           lseek
  2.70    0.000147          49         3           brk
  1.47    0.000080          80         1           write
  1.10    0.000060          60         1           ioctl
  0.94    0.000051          51         1           cacheflush
  0.83    0.000045          45         1           uname
  0.75    0.000041          41         1           set_tls
------ ----------- ----------- --------- --------- ----------------
100.00    0.005443                    75         4 total

real    0m0.038s
user    0m0.010s
sys     0m0.010s

jq-1.5 HEAD

For 1.5, ./configure --without-oniguruma is done.

# ./jq --version
jq-1.5rc2-237-gcb3d5af

# time echo '{}' | strace -wc ./jq .
{}
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 15.36    0.000950          68        14           mmap2
 14.82    0.000917         102         9           brk
 14.57    0.000901         901         1           execve
  9.75    0.000603          55        11           read
  9.41    0.000582          65         9           mprotect
  7.44    0.000460          77         6         1 open
  5.38    0.000333          48         7           fstat64
  5.04    0.000312         104         3           munmap
  4.30    0.000266          44         6           close
  4.28    0.000265          66         4         3 access
  4.17    0.000258          43         6           lseek
  1.39    0.000086          86         1           write
  0.97    0.000060          60         1           ioctl
  0.86    0.000053          53         1           getcwd
  0.84    0.000052          52         1           cacheflush
  0.79    0.000049          49         1           uname
  0.63    0.000039          39         1           set_tls
------ ----------- ----------- --------- --------- ----------------
100.00    0.006186                    82         4 total

real    0m0.895s
user    0m0.830s
sys     0m0.010s

You can see the elapsed time becomes almost 1 second for this minimum JSON and 23.5 times longer than 1.4.

Here is another trace of jq-1.5.

# time echo '{}' | strace -r ./jq .
     0.000000 execve("./jq", ["./jq", "."], [/* 47 vars */]) = 0
     0.001837 brk(0)                    = 0x12f5000
     0.000298 uname({sys="Linux", node="raspberry", ...}) = 0
     0.000201 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
     0.000242 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76fa1000
     0.000226 access("/etc/ld.so.preload", R_OK) = 0
     0.000201 open("/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
     0.000222 fstat64(3, {st_mode=S_IFREG|0644, st_size=42, ...}) = 0
     0.000219 mmap2(NULL, 42, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x76fa0000
     0.000214 close(3)                  = 0
     0.000240 open("/usr/lib/arm-linux-gnueabihf/libarmmem.so", O_RDONLY|O_CLOEXEC) = 3

... (omitted) ...

     0.000451 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9f000
     0.000328 set_tls(0x76f9f4c0, 0x76f9fba8, 0x76fa4050, 0x76f9f4c0, 0x76fa4050) = 0
     0.000473 mprotect(0x76ec0000, 8192, PROT_READ) = 0
     0.000299 mprotect(0x76f3f000, 4096, PROT_READ) = 0
     0.000240 mprotect(0x76f5f000, 20480, PROT_READ|PROT_WRITE) = 0
     0.000300 mprotect(0x76f5f000, 20480, PROT_READ|PROT_EXEC) = 0
     0.000257 cacheflush(0x76f5f000, 0x76f64000, 0, 0x15, 0x7ea53dd0) = 0
     0.000313 mprotect(0x76fa3000, 4096, PROT_READ) = 0
     0.000226 munmap(0x76f41000, 120366) = 0
     0.001042 brk(0)                    = 0x12f5000
     0.000234 brk(0x1316000)            = 0x1316000
     0.000392 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
     0.000752 getcwd("/root/Downloads/jq", 4096) = 19
     0.000663 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
     0.032603 brk(0x1337000)            = 0x1337000
     0.018625 brk(0x1358000)            = 0x1358000
     0.041019 brk(0x1379000)            = 0x1379000
     0.808403 brk(0x139a000)            = 0x139a000
     0.389545 brk(0x1399000)            = 0x1399000
     0.001031 brk(0x131e000)            = 0x131e000
     0.000708 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
     0.000251 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9e000
     0.000291 read(0, "{}\n", 4096)     = 3
     0.000443 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
     0.000235 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9d000
     0.000268 write(1, "\33[1;39m{}\33[0m\n", 14{}
) = 14
     0.000272 read(0, "", 4096)         = 0
     0.000215 close(1)                  = 0
     0.000164 munmap(0x76f9d000, 4096)  = 0
     0.000430 exit_group(0)             = ?
     0.000826 +++ exited with 0 +++

real    0m1.445s
user    0m0.990s
sys     0m0.020s

The major difference comes from brk() system calls after open("/root/.jq", O_RDONLY).

     0.032603 brk(0x1337000)            = 0x1337000
     0.018625 brk(0x1358000)            = 0x1358000
     0.041019 brk(0x1379000)            = 0x1379000
     0.808403 brk(0x139a000)            = 0x139a000
     0.389545 brk(0x1399000)            = 0x1399000
     0.001031 brk(0x131e000)            = 0x131e000

0.808403 and 0.389545 seconds are not taken by brk() itself as indicated by strace -wc
These are implying something heavy going on in the user space between these system calls.

jq-1.4 does not have these brk() calls at all. The problematic part is introduced by jq-1.5.
I put strace output of jq-1.4 below after open("/root/.jq", O_RDONLY).

... (omitted) ...

     0.000165 munmap(0x76ead000, 120366) = 0
     0.000882 brk(0)                    = 0x1213000
     0.000163 brk(0x1234000)            = 0x1234000
     0.000590 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
     0.008392 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
     0.000301 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f0a000
     0.000215 read(0, "{}\n", 4096)     = 3
     0.000333 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
     0.000217 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
     0.000280 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f09000
     0.000241 write(1, "\33[37m{}\33[0m\n", 12{}
) = 12
     0.000217 read(0, "", 4096)         = 0
     0.000171 close(0)                  = 0
     0.000135 munmap(0x76f0a000, 4096)  = 0
     0.000497 exit_group(0)             = ?
     0.000587 +++ exited with 0 +++

real    0m0.043s
user    0m0.020s
sys     0m0.010s

Just to make it clear, we have much free memory, so the issue is not triggered by memory shortage.

# free -h
             total       used       free     shared    buffers     cached
Mem:          733M       616M       116M        55M        86M       340M
-/+ buffers/cache:       189M       543M
Swap:         370M        36M       334M

Thank you in advance!

Answer 1 · 2017-05-17T16:14:11.000Z

Perhaps something in the JSON parser has started doing more allocations than it should. That's all I can think of right now. Thanks for this report!

Answer 2 · 2017-05-20T11:19:23.000Z

So, I ran jq 1.5 (the release branch) through callgrind, and it looks like we're spending most of our time in... block_bind_subblock! Though, this isn't particularly surprising (see also: #1380)

For posterity, I've attached is the call graph generated by callgrind (and kcachegrind).

As such, I don't think this is an issue, so much as it is more reason to actually deal with #1380.

As for the extra calls to brk(), I'd expect them to be because we have so many more builtins now and they all get compiled/bound/stripped every time we compile a jq program.

Answer 3 · 2017-05-20T17:56:06.000Z

@wtlangford Good job. Yeah, we need to optimize binding!

Answer 4 · 2017-05-21T06:08:36.000Z

@wtlangford It would be useful to know how much more time we spend in block_bind_subblock() called from yyparse() versus from linking. See my commentary on #1380.

Answer 5 · 2017-05-21T11:47:54.000Z

It's hard to say... kcachegrind is not my favorite tool, and this is all too fast overall for something like gprof (which I'm a bit better with).

Notes about these times:

callgrind doesn't actually measure time. It instead measures things like "instructions fetched" and "data read". I'm using "instructions fetched" as a reference point for time. Obviously that's a bit fuzzy, but it should do.
Percentages are relative to the the caller (unless otherwise stated).

This is all based off of an invocation of jq that looks like this:

echo '{}' | valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./jq .

Onwards to data!

Almost all of our time (99.84% of the total process time) is spent in jq_compile_args() -> builtins_bind()
Here's a breakdown of that time:

26.08% in builtins_bind_one() (in jq_parse_library())
46.43% in gen_cbinding() (in repeated calls to block_bind())
25.56% in block_drop_unreferenced() (in block_count_refs())

All of the time spent in jq_parse_library() is spent in yyparse() (which makes sense, given that the program we're compiling is just .. I'd expect a more complex program to spend time in block_bind())

Looking at the callers of block_bind_subblock()yields the following breakdown of time spent in block_bind_subblock():

31.17% comes from yyparse()
64.77% comes from gen_cbinding()

Lots of numbers here, and I'm not sure how well I trust them, but the expectation is that spending less time binding/binding more efficiently will speed things up.

Answer 6 · 2017-05-21T18:28:38.000Z

Thanks @wtlangford, that confirms my thinking that just going from O(N_insts * M_builtins) to O(N_insts * log(M_builtins)) would be a big win (especially for tiny programs like .). Most jq programs aren't that large, so that should be a big win. Precompiling builtins might be a huge win too, though that would be a significant complication.