bloomberg/pystack

Optimize core file reading

ivonastojanovic opened this issue · 0 comments

Pystack can print the stack trace of a Python core dump by running pystack core $CORE_FILE. During the execution of this command, Pystack needs to copy memory from the core file and/or several shared libraries. In order to read something from a file one needs to open a file, change the current read position and then read the data from the file. All of these require system calls and these operations take time. By running the strace -cw -- python3 -m pystack core CORE_FILE we can get some statistics on how expensive system calls are when we execute pystack with a core file. In the table below we can see the system calls that are issued when we run pystack over a core together with the number of calls per system call and how much wall time it takes in total per system call. What we want to do is to reduce the total time we spend in the open/close/read system calls when operating over files:

% time     seconds  usecs/call     calls    errors syscall 

------ ----------- ----------- --------- --------- ---------------- 

15.20    0.066126          55      1188       392 stat 

14.50    0.063094          50      1238        45 openat 

13.38    0.058219          48      1199         3 close 

11.82    0.051429          50      1015           read 

  9.08    0.039498          45       863         3 lseek 

  7.63    0.033201          52       632           fstat 

  6.75    0.029364          57       507           mmap 

  5.59    0.024325          60       405           munmap 

  4.24    0.018435          48       380           fcntl 

  4.20    0.018264          44       413       329 readlink 

  3.31    0.014414          52       273       173 ioctl 

  1.47    0.006394          96        66           write 

  0.94    0.004084          60        68           rt_sigaction 

  0.53    0.002298          95        24           getdents64 

  0.45    0.001950          67        29           brk 

  0.36    0.001555          59        26           mprotect 

  0.12    0.000508          50        10           lstat 

  0.09    0.000402          50         8           pread64 

  0.06    0.000245         244         1           execve 

  0.04    0.000185          46         4           futex 

  0.04    0.000173          57         3           sigaltstack 

  0.03    0.000151          75         2           getcwd 

  0.03    0.000149          49         3           dup 

  0.02    0.000102          50         2         1 arch_prctl 

  0.02    0.000084          83         1           sysinfo 

  0.02    0.000068          68         1           gettid 

  0.01    0.000055          55         1           set_robust_list 

  0.01    0.000053          52         1           getrandom 

  0.01    0.000052          51         1         1 access 

  0.01    0.000051          50         1           rt_sigprocmask 

  0.01    0.000050          50         1           prlimit64 

  0.01    0.000043          43         1           set_tid_address 

------ ----------- ----------- --------- --------- ---------------- 

100.00    0.435018                  8367       947 total  

When a program reads data from a file, then a location in that file is accessed and transferred from the hard drive to a buffer in memory. In order to try to reduce these system calls we want to use memory mapped files. The idea of memory mapping is to map a core file directly into a virtual memory pages. One of the reasons we think this is going to be faster is because accessing directly the memory mapped by mmap doesn't require to go to kernel mode and back (skips the syscall cost). To read more about mmap: https://man7.org/linux/man-pages/man2/mmap.2.html