Flame Graphs visualize hot-CPU code-paths. Using DTrace, see: http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ Using perf_events or SystemTap, see: http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flame-graphs/ Using XCode Instruments, see: http://schani.wordpress.com/2012/11/16/flame-graphs-for-instruments/ These can be created in three steps: 1. Capture stacks 2. Fold stacks 3. flamegraph.pl 1. Capture stacks ================= Stack samples can be captured using DTrace, perf_events or SystemTap. Using DTrace to capture 60 seconds of kernel stacks at 997 Hertz: # dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); } tick-60s { exit(0); }' -o out.kern_stacks Using DTrace to capture 60 seconds of user-level stacks for PID 12345 at 97 Hertz: # dtrace -x ustackframes=100 -n 'profile-97 /PID == 12345 && arg1/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o out.user_stacks Using DTrace to capture 60 seconds of user-level stacks, including while time is spent in the kernel, for PID 12345 at 97 Hertz: # dtrace -x ustackframes=100 -n 'profile-97 /PID == 12345/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o out.user_stacks Switch ustack() for jstack() if the application has a ustack helper to include translated frames (eg, node.js frames; see: http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/). The rate for user-level stack collection is deliberately slower than kernel, which is especially important when using jstack() as it performs additional work to translate frames. 2. Fold stacks ============== Use the stackcollapse programs to fold stack samples into single lines. The programs provided are: - stackcollapse.pl: for DTrace stacks - stackcollapse-perf.pl: for perf_events "perf script" output - stackcollapse-stap.pl: for SystemTap stacks - stackcollapse-instruments.pl: for XCode Instruments Usage example: $ ./stackcollapse.pl out.kern_stacks > out.kern_folded The output looks like this: unix`_sys_sysenter_post_swapgs 1401 unix`_sys_sysenter_post_swapgs;genunix`close 5 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf 85 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf;c2audit`audit_closef 26 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf;c2audit`audit_setf 5 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf;genunix`audit_getstate 6 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf;genunix`audit_unfalloc 2 unix`_sys_sysenter_post_swapgs;genunix`close;genunix`closeandsetf;genunix`closef 48 [...] 3. flamegraph.pl ================ Use flamegraph.pl to render a SVG. $ ./flamegraph.pl out.kern_folded > kernel.svg An advantage of having the folded input file (and why this is separate to flamegraph.pl) is that you can use grep for functions of interest. Eg: $ grep cpuid out.kern_folded | ./flamegraph.pl > cpuid.svg Provided Example ================ An example output from DTrace is included, both the captured stacks and the resulting Flame Graph. You can generate it yourself using: $ ./stackcollapse.pl example-stacks.txt | ./flamegraph.pl > example.svg This was from a particular performance investigation: the Flame Graph identified that CPU time was spent in the lofs module, and quantified that time. Options ======= See the USAGE message (--help) for options: USAGE: ./flamegraph.pl [options] infile > outfile.svg --titletext # change title text --width # width of image (default 1200) --height # height of each frame (default 16) --minwidth # omit smaller functions (default 0.1 pixels) --fonttype # font type (default "Verdana") --fontsize # font size (default 12) --countname # count type label (default "samples") --nametype # name type label (default "Function:") eg, ./flamegraph.pl --titletext="Flame Graph: malloc()" trace.txt > graph.svg As suggested in the example, flame graphs can process traces of any event, such as malloc()s, provided stack traces are gathered.