/async-profiler

Sampling CPU profiler for Java featuring AsyncGetCallTrace + perf_events

Primary LanguageC++Apache License 2.0Apache-2.0

async-profiler

This project is a proof-of-concept implementation of a Java profiler that features the HotSpot-specific AsyncGetCallTrace API for Java call stacks, combined with perf_events for native call stacks. The general approach is receiving call stacks generated by perf_events and matching them up with call stacks generated by AsyncGetCallTrace, in order to produce an accurate profile of both Java and native code, including kernel frames.

This approach has the following advantages compared to using perf_events directly with a Java agent that translates addresses to Java method names:

  • Works on older Java versions because it doesn't require -XX:+PreserveFramePointer, which is only available in Java 1.8u60 and later.

  • Does not introduce the performance overhead from -XX:+PreserveFramePointer, which can in rare cases be as high as 10%.

  • Does not require generating a map file to map Java code addresses to method names.

  • Works with interpreter frames.

  • Does not require writing out a perf.data file for further processing in user space scripts.

Because of the requirement for perf_events, this tool works only on Linux platforms. Additionally, on some Linux versions, two runtime variables need to be tuned in order to allow non-root users to use perf_events.

Building

Build status: Build Status

Make sure the JAVA_HOME environment variable points to your JDK installation, and then run make. GCC is required. After building, the profiler agent binary will be in the build subdirectory. Additionally, a small Java application that can load the agent into the target process will also be compiled to the build subdirectory.

Basic Usage

As of Linux 4.6, capturing kernel call stacks using perf_events from a non- root process requires setting two runtime variables. You can set them using sysctl or as follows:

# echo 1 > /proc/sys/kernel/perf_event_paranoid
# echo 0 > /proc/sys/kernel/kptr_restrict

To run the agent and pass commands to it, the helper script profiler.sh is provided. A typical workflow would be to launch your Java application, attach the agent and start profiling, exercise your performance scenario, and then stop profiling. The agent's output, including the profiling results, will be displayed in the Java application's standard output.

Example:

$ jps
9234 Jps
8983 Computey
$ ./profiler.sh -p 8983 -a start
$ ./profiler.sh -p 8983 -a stop

By default, the profiling frequency is 100Hz (every 10ms). Here is a sample of the output printed to the Java application's terminal:

--- Execution profile ---
Total:                   687
Unknown (native):        1 (0.15%)

Samples: 679 (98.84%)
    [ 0] Primes.isPrime
    [ 1] Primes.primesThread
    [ 2] Primes.access$000
    [ 3] Primes$1.run
    [ 4] java.lang.Thread.run

... a lot of output omitted for brevity ...

         679 (98.84%) Primes.isPrime
           4 (0.58%)  __do_softirq

... more output omitted ...

This indicates that the hottest method was Primes.isPrime, and the hottest call stack leading to it comes from Primes.primesThread.

Generating Flame Graphs

To generate flame graphs or other visualizations from the collected profiling information, you will need to dump the raw collected traces to a file and then post-process it. The following example uses Brendan Gregg's FlameGraph scripts, but a similar solution can be tailored to other visualization tools.

$ jps
9234 Jps
8983 Computey
$ ./profiler.sh -p 8983 -o interval:100000000 -a start
$ ./profiler.sh -p 8983 -f /tmp/traces.txt -a dump
$ FlameGraph/flamegraph.pl --colors=java /tmp/traces.txt > /tmp/flamegraph.svg

All Agent Options

The following is a complete list of the command-line options accepted by the agent. Some options can be combined, e.g. the profiling interval can be provided at the same time as the start command, by separating them with commas: interval:1000000,start. When using the profiler.sh helper script, start/stop/dump are actions, and the rest of the switches are provided as options. See examples below for how to use these options with the helper script, but you can also pass them directly to the agent using the jattach utility, which is built as part of this tool.

  • interval:N - sets the profiling interval, in CPU cycles. The default is 10000000 (10M) cycles. Example: ./profiler.sh -p 8983 -o interval:100000 -a start.

  • duration:N - sets the duration of the trace, in seconds. After this time, the trace will stop automatically. The default is 3600 seconds (1 hour). Example: ./profiler.sh -p 8983 -o duration:60 -a start. NOTE: This feature is not currently supported; profiling will not stop automatically.

  • frameBufferSize:N - sets the frame buffer size, in the number of Java method ids that should fit in the buffer. If you receive messages about an insufficient frame buffer size, increase this value from the default. Example: ./profiler.sh -p 8983 -o frameBufferSize:1000000 -a start.

  • start - starts profiling.

  • stop - stops profiling and prints the report.

  • dumpRawTraces:filename - stops profiling and dumps the raw trace information to the specified file. Example: ./profiler.sh -p 8983 -f /tmp/traces -a dump.

The format of the raw trace file is a collection of call stacks, where each line is a semicolon separated list of frames followed by the sample count. For example:

java/lang/Thread.run;Primes$1.run;Primes.access$000;Primes.primesThread;Primes.isPrime 1056

This stack appeared 1056 times in the trace. This is the same stack trace format used by the FlameGraph script.

Restrictions/Limitations

  • On most Linux systems, perf_events captures call stacks with a maximum depth of 127 frames. On recent Linux kernels, this can be configured using sysctl kernel.perf_event_max_stack or by writing to the /proc/sys/kernel/perf_event_max_stack file.

  • Profiler allocates 8kB perf_event buffer for each thread of the target process. Make sure /proc/sys/kernel/perf_event_mlock_kb value is large enough (more than 8 * threads) when running under unprivileged user. Otherwise the message "perf_event mmap failed: Operation not permitted" will be printed, and no native stack traces will be collected.

  • There is no bullet-proof guarantee that the perf_events overflow signal is delivered to the Java thread in a way that guarantees no other code has run, which means that in some rare cases, the captured Java stack might not match the captured native (user+kernel) stack.

  • You will not see the non-Java frames preceding the Java frames on the stack. For example, if start_thread called JavaMain and then your Java code started running, you will not see the first two frames in the resulting stack. On the other hand, you will see non-Java frames (user and kernel) invoked by your Java code.