Simple wrapper for PAPI most common used functions: set up events, start, stop and print counters within a region of interest. Besides, within that region can define many subregions in order to count them separately. Thus, this library simplifies setting up low-level features of PAPI such as domain, granularity or overflow.
PAPI's low-level API allows programmers to program different hardware counters while executing a program. However, programming those counters may introduce a lot of complex code onto the original program. Besides, configuring properly PAPI may be tedious. For these reasons we have created a set of macros in order to simplify the problem, while configuring PAPI at compilation time: either using flags or files.
This interface works for multithreaded programs using OpenMP; actually this was the main reason for developing this interface. See Usage and Options sections for further details.
It is only needed to rewrite our code as ( <papi.h>
header files are already
included in papi_wrapper
):
#include <papi_wrapper.h>
...
pw_init_instruments; /* initialize counters */
pw_start_instruments; /* starts automatically PAPI counters */
/* region of interest (ROI) to measure */
pw_stop_instruments; /* stop counting */
pw_print_instruments; /* print results */
This way, it is only needed to add the header #include <papi_wrapper.h>
to
the source code and compile with -I/source/to/papi-wrapper papi_wrapper.c -lpapi
. Another way to use this wrapper library, using subregions, could be:
#include <papi_wrapper.h>
...
pw_init_start_instruments_sub(2); /* initialize counters */
/* region of interest (ROI) to measure */
pw_begin_subregion(0);
/* subregion 0 */
pw_end_subregion(0);
pw_begin_subregion(1);
/* subregion 1 */
pw_end_subregion(1);
pw_stop_instruments; /* stop counting */
pw_print_subregions; /* print results */
TL; DR: macros available in papi_wrapper
:
pw_set_thread_report
: set thread to measure when single thread.pw_init_instruments
: init PAPI and flush caches.pw_init_start_instruments
: wrapper forpw_init_instruments
andpw_start_instruments
pw_init_start_instruments_sub(n)
: init library and set number of regions to measurepw_start_instruments
: start counting.pw_stop_instruments
: stop counting.pw_start_instruments_loop(n)
: to use within a parallel loop, e.g.#pragma omp parallel for
.pw_stop_instruments_loop(n)
: to use within a parallel loop, e.g.#pragma omp parallel for
.pw_begin_subregion(n)
: start measuringn
region.pw_end_subregion(n)
: stop measuringn
region.pw_print_instruments
: print counters.pw_print_subregions
: print counters by subregion measured.
For more examples, refer to tests
subdirectory. They can be executed with
CTest
.
- PAPI >=5.x
- GCC C compiler >=8.x
- PAPI library
- Doxygen >=1.8
In order to differenciate PAPI macros and PAPI_wrapper macros, as mnemonics
all PAPI_wrapper options begin with the prefix PW_
.
High-level configuration parameters:
-DPW_THREAD_MONITOR
- default value0
. Indicates the master thread ifPW_MULTITHREAD
also enabled.-DPW_MULTITHREAD
- disabled by default. If not defined, onlyPW_THREAD_MONITOR
will count events (only one thread). This option is not compatible when using uncore events, it basically makes the PAPI library crash. Need to be compiled with-fopenmp
.-DPW_VERBOSE
- disabled by default. More text in the output and errors.-DPW_CSV
- disabled by default. Print in CSV format using comma (-DPW_CSV_SEPARATOR=","
) as divider where first row contains the thread number and the names of the hardware counters used, containing the following rows each thread and its counter values.-DPW_FILE
- print output to file specified by-DPW_FILENAME=<file>
(default to/tmp/__tmp_papi_wrapper.output
), instead of standard output.
Low-level configuration parameters (refer to PAPI for further information):
-DPW_GRN=<granularity>
- default valuePAPI_GRN_MIN
.-DPW_DOM=<domain>
- default valuePAPI_DOM_ALL
.-DPW_SAMPLING
- disabled by default. Enables sampling for all the events specified inPW_FLIST
with thresholds specified inPW_FSAMPLE
.
Configuration files (see their format incircleci/circleci-docs/tree/teesloane-patch-5
PAPI wrapper may be precompiled and linked to your executable or compiled
directly with your sources. PAPI wrapper basically initializes a PAPI event set
for each counter specified in the PW_FLIST
. If multithread is enabled, then
all threads will count events individually and simultaneously, but one counter
at a time. Multiplexing is a experimental feature that should be avoided in
PAPI, since in PAPI's discussions there is some skepticism regarding its
reliability.
When talking about the overhead of a library, we can think about the overhead of using it regarding execution time or memory. In any case, overheads:
- Costs of the PAPI library: initializing the library, starting counters,
stoping them. For further details refer to
papi_cost
utility. - Costs of PAPI wrapper: iterating over the list of events to measure. Each counter is measured individually, so there is no concurrency at all when it comes to measure different events. There is only concurrency when measuring different thread: they are all measured at the same time
List of known issues when testing:
- Do not mix uncore and not uncore events in the list: undefined behavior.
- Uncore events must explicitly have the
:cpu=X
flag on them. - With the major version 1.0.0, PAPI wrapper introduces subregions, which basically permits measuring different regions of code simoultaneously and individually. Nonetheless, if the region or subregion measured has an order of magnitude lower than the proper PAPI library cost (refer to Overhead), the result will have too much noise.
Refer to Releases webpage. Versioning follows Semantic Versioning 2.0.0.
Maintainer:
- Marcos Horro (marcos.horro (at) udc.gal)
Authors:
- Marcos Horro
- Dr. Gabriel RodrĂguez
This version is based on PolyBench, under GPLv2 license.
MIT License.