perf-cpp: Access Performance Counters from C++ Applications

perf-cpp is a powerful C++ library that provides direct access to hardware performance counters from the application. The library allows for precise event-counting and sampling of specific code segments and to link sampled data (e.g., memory addresses) with application-specific details (e.g., class instances).

Key Features

Count Hardware Events: Integrate performance monitoring into your application. Configure, start, and stop hardware counters to profile specific code segments.
Sampling: Leverage sampling to record performance data periodically, e.g., instruction pointers, memory addresses, access latency, branches, and more.
Customizable Event Configuration: Use built-in hardware events (e.g., cycles, instructions, cache-misses) and those specific to your underlying CPU. Additionally, define and utilize Metrics–quantitative measurements like cycles per instruction–to gain deeper insights into performance and efficiency.
Practical Examples: Jumpstart your implementation with the diverse collection of examples that demonstrate practical applications of the library.

Quick Start

Get up and running with perf-cpp in seconds:

# Clone the repository
git clone https://github.com/jmuehlig/perf-cpp.git

# Switch to the repository folder
cd perf-cpp

# Optional: Switch to the latest stable version
git checkout v0.9.0

# Build the library (in build/)
cmake . -B build -DBUILD_EXAMPLES=1
cmake --build build

# Optional: Build examples (in build/examples/bin)
cmake --build build --target examples

For detailed building instructions, including how to integrate perf-cpp into your CMake projects, visit our build guide.

Usage Examples

Count Hardware Events

Quickly set up hardware event monitoring:

#include <perfcpp/event_counter.h>

/// Initialize the counter
auto counters = perf::CounterDefinition{};
auto event_counter = perf::EventCounter{ counters };

/// Specify hardware events to count
event_counter.add({"seconds", "instructions", "cycles", "cache-misses"});

/// Run the workload
event_counter.start();
your_workload(); /// <-- Your code to profile
event_counter.stop();

/// Print the result to the console
const auto result = event_counter.result();
for (const auto [event_name, value] : result)
{
    std::cout << event_name << ": " << value << std::endl;
}

Possible output:

seconds:      0.0955897 
instructions: 5.92087e+07
cycles:       4.70254e+08
cache-misses: 1.35633e+07

For further details, including how to count events in parallel settings, visit our guide on recording events.

Record Samples

Implement detailed sampling with control over the recorded content:

#include <perfcpp/sampler.h>

/// Create the sampler
auto counters = perf::CounterDefinition{};
auto sampler = perf::Sampler{ counters };

/// Specify when a sample is recorded: every 4000th cycle
sampler.trigger("cycles", perf::Period{4000U});

/// Specify what metadata is included into a sample: time, CPU ID, instruction
sampler.values()
    .time(true)
    .cpu_id(true)
    .instruction_pointer(true);

/// Run the workload
sampler.start();
your_workload(); /// <-- Your code to profile
sampler.stop();

/// Print the samples to the console
const auto samples = sampler.result();
for (const auto& sample_record : samples)
{
    const auto time = sample_record.time().value();
    const auto cpu_id = sample_record.cpu_id().value();
    const auto instruction = sample_record.instruction_pointer().value();
    
    std::cout 
        << "Time = " << time << " | CPU = " << cpu_id
        << " | Instruction = 0x" << std::hex << instruction << std::dec
        << std::endl;
}

Possible output:

Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c
Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c

For further details, for example, which metrics can be included into samples, visit our sampling guide.

Advanced Examples

We include a comprehensive collection of examples demonstrating the advanced capabilities of perf-cpp, including, for example, counting events in parallel settings and sampling memory accesses.

All code examples are available in the examples/ folder.

System Requirements

C++ Standard: Requires support for C++17 features.
CMake Version: 3.10 or higher.
Linux Kernel Version: 4.0 or newer (note that some features need a newer Kernel).
perf_event_paranoid Setting: Adjust as needed to allow access to performance counters (see the Paranoid Value Section below).

Adjusting `perf_event_paranoid` Value

The perf_event_paranoid setting controls access to performance counters:

-1: No restrictions (full access).
0: Allow normal users access, but no raw tracepoint samples.
1: Allow user and kernel-level profiling (default since Linux 4.6).
>= 2: Only user-level measurements allowed.

Checking the Current Value

cat /proc/sys/kernel/perf_event_paranoid

Changing the Value Temporarily

sudo sysctl -w kernel.perf_event_paranoid=-1

Note: To make this change permanent, edit /etc/sysctl.conf and add kernel.perf_event_paranoid = -1.

Contribute and Contact

We welcome contributions and feedback to make perf-cpp even better. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.

Alternatively, you can email me: jan.muehlig@tu-dortmund.de.

Further Profiling Projects

While perf-cpp is dedicated to providing developers with clear insights into application performance, it is part of a broader ecosystem of tools that facilitate performance analysis. Below is a non-exhaustive list of some other valuable profiling projects:

PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.

Resources about (Perf-) Profiling

This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).

Academical Papers

Blog Posts

C2C - False Sharing Detection in Linux Perf (2016)
PMU counters and profiling basics. (2018)
Detect false sharing with Data Address Profiling. (2019)
Advanced profiling topics. PEBS and LBR. (2018)

jmuehlig/perf-cpp