Sadly, this is apparently the ONLY profiler on Linux that actually works.
By "actually works," I mean that a profiler is supposed to give you a rough estimate where your program is spending the most time. Slow programs are slow for a variety of reasons, not just because of inefficient algorithms that waste CPU cycles. Lots of programs are I/O bound. If you profile such a program, you want to know where the I/O hotspots are, and you don't care where the CPU hotspots are, because they are tiny, by comparison.
As far as I can tell, after way too many hours of researching and testing every available tool, there is nothing on Linux that actually does this well. I had a simple test program, written in C++, that looped over a big file on disk, fseeking randomly and reading from the file a bunch of times. Obviously, such a program is spending almost all of its time in fseek. But not a single existing profiler would tell me this.
Not perf. Not gperftools. Not OProfile. Obviously not gprof. None of the usual suspects.
Some of them were telling me that a good portion of the running time was spent in rand (which was used to pick the random offset to fseek to). The truth was, this was only about 1% of the simple program's wall-clock runtime. If I want to speed this program up, optimizing rand is going to be a waste of time.
Valgrind's callgrind can provide better results than the others, but it makes the program 10x slower (or something like that), so it's useless in production environments (for example, profiling a live server that has real users connecting to it).
And it turns out that what you actually want to do is really pretty simple. So simple that some very smart people have suggested that you do it manually:
Attach to your program in gdb. Interrupt it at random a bunch of times. See what stack traces repeat more often than you'd expect them to by random chance. If 50% of your random interrupts land in the same exact stack trace, there's a good chance your program is spending a lot of time there. Some people call this "poor man's profiling."
This profiler is a simple bit of C++ code that automates this task. And it uses GDB, so it gets stack traces for you that are as real and informative as stack traces from GDB. If your program is GDB-debuggable, this profiler will work on it.
And besides GDB and standard Linux system includes, it has no dependencies of any kind. It's also a single CPP file, a file that's so simple that it doesn't even need a build script, just:
g++ -o wallClockProfiler wallClockProfiler.cpp
You don't need to rebuild or relink your program to profile it (though building your program with debugging symbols would probably be helpful).
It can either run your target program directly, or attach to an existing process. It can sample however many times per second you want it to, and it can detach automatically after a certain number of seconds.
For example, if your server process is experiencing heavy load right now, you can attach to it for the next ten seconds, grab a few hundred stack samples, then see a nice little text report telling you exactly where your server is spending its time right now.
Overhead scales with your chosen sampling rate. If you're looking for a big problem, a relatively low sampling rate (and thus a low overhead) will be sufficient to catch it. For example, With 20 samples per second, I'm seeing a 12% slowdown on my test program.
Run ./myProgram and sample the stack 20 times per second until it exits:
./wallClockProfiler 20 ./myProgram
Run ./myProgram with arguments and sample the stack 20 times per second until it exits:
./wallClockProfiler 20 "./myProgram arg1 arg2"
Attach to an existing process ./myProgram (PID 3042) and sample the stack 20 times per second for 60 seconds:
./wallClockProfiler 20 ./myProgram 3042 60
Note that in order to attach to a process, you may need to be root, like this:
sudo ./wallClockProfiler 20 ./myProgram 3042 60
Wouldn't it be nice to be able to inspect a variable in a live, running process while interrupting that process as little as possible? For example what if you have a live server running, with clients connected, and you need to debug its current state, but you can't attach to it with a manual debugger, because that would interrupt the process too much.
Using similar techniques to the ones used by wallClockProfiler, variablePrinter breaks your program at a specified line number and prints the value of a specified variable.
Compile it like this:
g++ -o variablePrinter variablePrinter.cpp
Run it like this:
./variablePrinter ./myProgram 7339 myProgram.cpp 153 myVariable
This will attached to PID 7339 and set a break point at line 153 of file myProgram.cpp. When that breakpoint is reached, the value of myVariable will be printed. And then the process will be detached.
Note that attaching to a process may require root, like this:
sudo ./variablePrinter ./myProgram 7339 myProgram.cpp 153 myVariable
From profiling the aforementioned test program that does a lot of random fseeks:
Sampling stack while program runs...
Sampling 20 times per second, for 50000 usec between samples
Program exited normally
313 stack samples taken
3 unique stacks sampled
Report:
81.789% =====================================
1: __kernel_vsyscall (at :-1)
2: __read_nocancel (at ../sysdeps/unix/syscall-template.S:81)
3: _IO_new_file_seekoff (at fileops.c:1079)
4: _IO_seekoff_unlocked (at ioseekoff.c:69)
5: __GI_fseek (at fseek.c:39)
6: readRandFileValue (at testProf.cpp:15)
7: main (at testProf.cpp:41)
17.572% =====================================
1: __kernel_vsyscall (at :-1)
2: __llseek (at ../sysdeps/unix/sysv/linux/llseek.c:33)
3: __GI__IO_file_seek (at fileops.c:1214)
4: _IO_new_file_seekoff (at fileops.c:1072)
5: _IO_seekoff_unlocked (at ioseekoff.c:69)
6: __GI_fseek (at fseek.c:39)
7: readRandFileValue (at testProf.cpp:15)
8: main (at testProf.cpp:41)
0.319% =====================================
1: __GI__dl_debug_state (at dl-debug.c:74)
2: dl_main (at rtld.c:2305)
3: _dl_sysdep_start (at ../elf/dl-sysdep.c:249)
4: _dl_start_final (at rtld.c:332)
5: _dl_start (at rtld.c:558)
6: _start (at :-1)
You can see that 99% of the samples occurred down inside __GI-fseek
. No existing profiler showed me anything like this.
And how do I know whether this profile is accurate? Maybe it's just as inaccurate as the others (below), just in a different way...
But here's where the rubber meets the road in terms of program execution time: By commenting out the fseek calls (and their corresponding fgetc calls) that were found above, the execution time for the test program drops from 14.6 seconds down to 0.25 seconds. In other words, these I/O calls are indeed responsible for 99% of the execution time.
Even removing only the fseek calls (while still keeping the fgetc read calls---just reading the file sequentially 1 million times instead of randomly 1 million times), reduces the runtime from 14.6 seconds down to 0.36 seconds. Thus, 98% of our execution time is spent on those random seeks, not even on the actual file reads. Reading the data from the file isn't slow. But jumping around in a huge file is slow because of cache misses. And this profile alerts us to that fact.
From calling:
g++ -pg -g -o testProf testProf.cpp
./testProf
gprof ./testProf
gprof -l ./testProf
Output:
Flat function profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
55.88 0.10 0.10 10000000 9.50 9.50 getRandomBoundedInt(int, int)
38.24 0.16 0.07 10000000 6.50 16.00 readRandFileValue(_IO_FILE*)
5.88 0.17 0.01 main
0.00 0.17 0.00 1 0.00 0.00 _GLOBAL__sub_I_MAX
0.00 0.17 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
Flat line profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
23.53 0.04 0.04 readRandFileValue(_IO_FILE*) (testProf.cpp:26 @ 80486d0)
23.53 0.08 0.04 getRandomBoundedInt(int, int) (testProf.cpp:14 @ 8048677)
11.76 0.10 0.02 readRandFileValue(_IO_FILE*) (testProf.cpp:24 @ 80486b5)
11.76 0.12 0.02 getRandomBoundedInt(int, int) (testProf.cpp:10 @ 8048658)
11.76 0.14 0.02 getRandomBoundedInt(int, int) (testProf.cpp:16 @ 80486a0)
5.88 0.15 0.01 10000000 1.00 1.00 getRandomBoundedInt(int, int) (testProf.cpp:9 @ 804864d)
2.94 0.15 0.01 10000000 0.50 0.50 readRandFileValue(_IO_FILE*) (testProf.cpp:23 @ 80486aa)
2.94 0.16 0.01 getRandomBoundedInt(int, int) (testProf.cpp:17 @ 80486a8)
2.94 0.17 0.01 main (testProf.cpp:42 @ 8048738)
2.94 0.17 0.01 main (testProf.cpp:43 @ 8048742)
0.00 0.17 0.00 1 0.00 0.00 _GLOBAL__sub_I_MAX (testProf.cpp:50 @ 80487c9)
0.00 0.17 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) (testProf.cpp:50 @ 804878c)
From calling:
perf record -g ./testProf
perf report
Output:
Samples: 59K of event 'cycles', Event count (approx.): 3556233224
+ 47.12% testProf [kernel.kallsyms] [k] __copy_to_user_ll
+ 9.35% testProf [kernel.kallsyms] [k] find_get_page
+ 5.12% testProf [kernel.kallsyms] [k] radix_tree_lookup_element
+ 3.95% testProf [kernel.kallsyms] [k] native_flush_tlb_single
+ 3.51% testProf [kernel.kallsyms] [k] sysenter_past_esp
+ 3.36% testProf [kernel.kallsyms] [k] generic_file_aio_read
+ 2.37% testProf [vdso] [.] 0x00000428
+ 2.13% testProf libc-2.19.so [.] _IO_file_seekoff@@GLIBC_2.1
+ 1.25% testProf [kernel.kallsyms] [k] ext4_llseek
+ 1.24% testProf [kernel.kallsyms] [k] fget_light
+ 1.09% testProf [kernel.kallsyms] [k] sys_llseek
+ 1.08% testProf [kernel.kallsyms] [k] _copy_to_user
+ 1.05% testProf libc-2.19.so [.] _IO_getc
+ 1.02% testProf [kernel.kallsyms] [k] common_file_perm
+ 0.99% testProf [kernel.kallsyms] [k] do_sync_read
+ 0.98% testProf testProf [.] getRandomBoundedInt(int, int)
+ 0.91% testProf [kernel.kallsyms] [k] vfs_read
+ 0.88% testProf [kernel.kallsyms] [k] current_kernel_time
+ 0.77% testProf [kernel.kallsyms] [k] fsnotify
+ 0.75% testProf [kernel.kallsyms] [k] kmap_atomic_prot
+ 0.73% testProf [kernel.kallsyms] [k] file_read_actor
+ 0.71% testProf libc-2.19.so [.] fseek
Enabling callgraphis in the profile and output like so:
perf record --call-graph fp ./testProf
perf report --call-graph --stdio
gives us a bit more information in the profile, but it's still not that useful, and still doesn't pinpoint fseek:
# Overhead Command Shared Object Sym
# ........ ........ ................. .......................................
#
46.05% testProf [kernel.kallsyms] [k] __copy_to_user_ll
|
--- __copy_to_user_ll
|
|--97.14%-- file_read_actor
| generic_file_aio_read
| do_sync_read
| vfs_read
| sys_read
| sysenter_after_call
| 0xb77b2428
| |
| --100.00%-- main
| __libc_start_main
|
|--2.48%-- _copy_to_user
| sys_llseek
| sysenter_after_call
| 0xb77b2428
--0.39%-- [...]
9.13% testProf [kernel.kallsyms] [k] find_get_page
|
--- find_get_page
|
|--99.26%-- generic_file_aio_read
| do_sync_read
| vfs_read
| sys_read
| sysenter_after_call
| 0xb77b2428
| |
| --100.00%-- main
| __libc_start_main
|
--0.74%-- do_sync_read
vfs_read
sys_read
sysenter_after_call
0xb77b2428
5.24% testProf [kernel.kallsyms] [k] radix_tree_lookup_element
|
--- radix_tree_lookup_element
|
|--99.08%-- radix_tree_lookup_slot
| find_get_page
| generic_file_aio_read
| do_sync_read
| vfs_read
| sys_read
| sysenter_after_call
| 0xb77b2428
|
--0.92%-- find_get_page
generic_file_aio_read
do_sync_read
vfs_read
sys_read
sysenter_after_call
0xb77b2428
From calling:
LD_PRELOAD=/usr/local/lib/libprofiler.so CPUPROFILE=prof.out ./testProf
pprof --text ./testProf ./prof.out
Output:
Total: 1456 samples
1339 92.0% 92.0% 1339 92.0% 0xb7737428
30 2.1% 94.0% 30 2.1% _IO_new_file_seekoff
15 1.0% 95.1% 15 1.0% _IO_getc
14 1.0% 96.0% 14 1.0% getRandomBoundedInt
8 0.5% 96.6% 8 0.5% __random_r
7 0.5% 97.0% 7 0.5% _IO_acquire_lock_fct
7 0.5% 97.5% 7 0.5% _IO_seekoff_unlocked
6 0.4% 97.9% 6 0.4% 0xb7737429
6 0.4% 98.4% 6 0.4% __GI_fseek
4 0.3% 98.6% 4 0.3% __GI__IO_file_seek
4 0.3% 98.9% 4 0.3% __llseek
4 0.3% 99.2% 25 1.7% readRandFileValue
3 0.2% 99.4% 3 0.2% __random
3 0.2% 99.6% 57 3.9% main
2 0.1% 99.7% 2 0.1% __read_nocancel
2 0.1% 99.9% 3 0.2% _init
1 0.1% 99.9% 1 0.1% 0xb773741d
1 0.1% 100.0% 1 0.1% rand
0 0.0% 100.0% 57 3.9% __libc_start_main
Here at least we see that 92% of our runtime was... down inside some mystery function. But that mystery function was called from somewhere, right? Even outputing to a full visual callgraph viewer (using the --callgrind
output option for pprof) still shows that 92% of our execution time was spent in a function with no callers.
Some sources suggest setting CPUPROFILE_REALTIME=1 to enable wall-clock based sampling, but I found that it made no difference in the resulting profile.
From calling:
operf ./testProf
opreport --symbols
Output:
samples % image name symbol name
180437 89.2334 no-vmlinux /no-vmlinux
5312 2.6270 libc-2.19.so _IO_file_seekoff@@GLIBC_2.1
4554 2.2521 [vdso] (tgid:6710 range:0xb7761000-0xb7761fff) [vdso] (tgid:6710 range:0xb7761000-0xb7761fff)
1961 0.9698 testProf getRandomBoundedInt(int, int)
1804 0.8922 libc-2.19.so fseek
1457 0.7205 libc-2.19.so fgetc
1030 0.5094 libc-2.19.so random_r
978 0.4837 libc-2.19.so _IO_seekoff_unlocked
842 0.4164 libc-2.19.so llseek
712 0.3521 libc-2.19.so random
687 0.3397 libc-2.19.so __x86.get_pc_thunk.bx
568 0.2809 testProf readRandFileValue(_IO_FILE*)
547 0.2705 libc-2.19.so _IO_file_seek
428 0.2117 testProf main
330 0.1632 libc-2.19.so _IO_file_read
250 0.1236 libc-2.19.so rand
232 0.1147 libc-2.19.so __read_nocancel
56 0.0277 libc-2.19.so read
8 0.0040 libc-2.19.so __uflow
5 0.0025 libc-2.19.so _IO_switch_to_get_mode
3 0.0015 libc-2.19.so _IO_default_uflow
2 9.9e-04 ld-2.19.so _dl_lookup_symbol_x
2 9.9e-04 libc-2.19.so _IO_file_underflow@@GLIBC_2.1
1 4.9e-04 ld-2.19.so dl_main
1 4.9e-04 libc-2.19.so _dl_addr
1 4.9e-04 libc-2.19.so malloc_init_state
From calling:
valgrind --tool=callgrind --collect-systime=yes ./testProf
callgrind_annotate callgrind.out.6218
Output:
--------------------------------------------------------------------------------
Ir sysCount sysTime
--------------------------------------------------------------------------------
5,380,492,114 19,999,640 50,476 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir sysCount sysTime file:function
--------------------------------------------------------------------------------
1,689,942,527 . . /build/buildd/eglibc-2.19/libio/fileops.c:_IO_
file_seekoff@@GLIBC_2.1 [/lib/i386-linux-gnu/libc-2.19.so]
430,000,000 . . /build/buildd/eglibc-2.19/libio/fseek.c:fseek
[/lib/i386-linux-gnu/libc-2.19.so]
429,999,996 . . /build/buildd/eglibc-2.19/libio/ioseekoff.c:_I
O_seekoff_unlocked [/lib/i386-linux-gnu/libc-2.19.so]
420,013,020 . . /build/buildd/eglibc-2.19/stdlib/random_r.c:ra
ndom_r [/lib/i386-linux-gnu/libc-2.19.so]
350,002,466 . . /build/buildd/eglibc-2.19/libio/getc.c:getc [/
lib/i386-linux-gnu/libc-2.19.so]
320,000,004 . . testProf.cpp:getRandomBoundedInt(int, int) [/h
ome/jasonrohrer2/checkout/wallClockProfiler/testProgram/testProf]
310,000,000 . . /build/buildd/eglibc-2.19/misc/../sysdeps/unix
/sysv/linux/llseek.c:llseek [/lib/i386-linux-gnu/libc-2.19.so]
230,000,000 . . /build/buildd/eglibc-2.19/stdlib/random.c:rand
om [/lib/i386-linux-gnu/libc-2.19.so]
220,000,008 . . testProf.cpp:readRandFileValue(_IO_FILE*) [/ho
me/jasonrohrer2/checkout/wallClockProfiler/testProgram/testProf]
180,010,670 . . /build/buildd/eglibc-2.19/string/../sysdeps/i3
86/i686/multiarch/strcat.S:0x0012694b [/lib/i386-linux-gnu/libc-2.19.so]
170,000,000 . . /build/buildd/eglibc-2.19/libio/fileops.c:_IO_
file_seek [/lib/i386-linux-gnu/libc-2.19.so]
100,000,000 . . /build/buildd/eglibc-2.19/libio/libioP.h:fseek
100,000,000 . . /build/buildd/eglibc-2.19/libio/libioP.h:getc
99,995,670 . . /build/buildd/eglibc-2.19/io/../sysdeps/unix/s
yscall-template.S:__read_nocancel [/lib/i386-linux-gnu/libc-2.19.so]
99,995,670 . . /build/buildd/eglibc-2.19/libio/fileops.c:_IO_
file_read [/lib/i386-linux-gnu/libc-2.19.so]
90,000,048 . . testProf.cpp:main [/home/jasonrohrer2/checkout
/wallClockProfiler/testProgram/testProf]
80,000,000 . . /build/buildd/eglibc-2.19/stdlib/rand.c:rand [
/lib/i386-linux-gnu/libc-2.19.so]
39,999,154 19,999,578 50,323 ???:_dl_sysinfo_int80 [/lib/i386-linux-gnu/ld-
2.19.so]
Yes! fseek is correctly at the top of the list. But the program ran 11x slower (160 seconds vs 14 seconds). Also, it's still overestimating the impact of the rand calls in the overall runtime. The Kcachegrind graphical visuallizer output (not shown here) pegs rand at 14% of the runtime, while only giving fseek 64%, instead of the 98% of the runtime that it actually occupies.
But at least valgrind finds fseek as the hotspot.