The libpoireau library intercepts a small fraction of calls to malloc/calloc/etc., to generate a statistically representative overview of an application's heap footprint. While the interceptor currently only tracks long-lived allocations (e.g., leaks), we plan to also implement guard pages, in the spirit of Electric Fence.
The sampling approach makes it possible to use this library in production with a minimal impact on performance (see the section on Performance overhead), and without any change to code generation, unlike, e.g., LeakSanitizer or Valgrind.
The library's implementation strategy, which offloads most of the
complexity to the kernel or an external analysis script, and only
overrides the system memory allocator (or any other allocator that
already overrides the system malloc) for the few sampled allocations,
means the instrumentation is less likely to radically change a
program's behaviour. Preloading libpoireau.so
is much less invasive
than slotting in, e.g., tcmalloc
only because one wants to debug allocations. The code base is also
much smaller, and easier to audit before dropping a new library in
production.
Finally, rather than scanning the heap for references, the
poireau.py
analysis script merely reports old allocations. For
application servers, and other workloads that expect to enter a steady
state quickly after startup, that is more useful than only reporting
unreachable objects: a slow growth in heap footprint is an issue, even
if the culprits are reachable, e.g., in a list that isn't getting
cleared when it should be.
libpoireau currently targets Linux 4.8+ (for statically defined
tracepoint support) on 64 bit platforms with 4 KB pages. Execute
make.sh
to create libpoireau.so
in the current directory; the code
requires a GCC-compatible C11 implementation.
Add LD_PRELOAD="$LD_PRELOAD:$path_to_libpoireau.so"
to the
environment before executing the program you wish to debug.
Before using libpoireau, we must register its static probepoints with
Linux perf
; this may be done before starting programs with
LD_PRELOAD
, or after, it does not matter.
sudo perf buildid-cache --add ./libpoireau.so
sudo perf list | grep poireau # should show tracepoints
We can now enable the tracepoints to generate perf events whenever
libpoireau
overrides a stdlib call.
sudo perf probe sdt_libpoireau:*
That's enough for Linux perf
to report these events, e.g., in perf top
. However, that's a lot of information, not necessarily useful.
Execute scripts/poireau.sh $PID
to start perf trace
on that PID
,
and feed the output to an allocation tracking script. Every 10
minutes, that script will dump a list of currently live old (> 5
minutes) sampled allocations. Send poireau.py
a HUP
signal to
instead get a list of all live sampled allocations. Old allocations
will eventually fill up with known leaks, or startup allocations;
remove all current old allocations from future reports by sending a
USR1
signal to poireau.py
.
A key advantage of having the analysis out of process is that we can
still provide information after a crash. Send a USR2
signal to
poireau.py
to list some recent calls to free
or realloc
, on the
off chance that it will help debug a use-after-free.
Perf often needs sudo
access, but it doesn't make sense to run all
of poireau.py
as root; poireau.sh
instead executes only perf
with sudo. In order to override the perf
binary under sudo
,
use PERF=`which perf` scripts/poireau.sh ...
.
You may also enable system-wide tracing by invoking poireau.sh
without any argument. This is mostly useful if only one process at a
time will ever LD_PRELOAD
libpoireau.so
: the analysis code in
poireau.py
does not currently tell processes apart when matching
allocations and frees (edit the global COMM
pattern in poireau.py
to only ingest events from programs that match a certain regex).
System-wide tracing makes it easier to track events that happen
immediately on program startup.
TL;DR:
-
Register
libpoireau
's tracepoints withperf buildid-cache --add
andperf probe
. -
Prepare your program to run with libpoireau.so instrumentation, e.g., with
LD_PRELOAD=/path/to/libpoireau.so
. -
Grab libpoireau tracepoint events by doing one of:
a. Start the instrumented program and run
scripts/poireau.sh $PROGRAM_PID
.b. Edit the
COMM
pattern inscripts/poireau.py
before runningscripts/poireau.sh
, then start the instrumented program. -
Wait for
poireau.sh
to report stacks for long-lived (> five minutes) sampled allocations, every ten minutes. -
Packages for
perf
can be wonky. Try to build from source and pointpoireau.sh
to custom executables by settingPERF=`which perf`
before runningpoireau.sh
.
Interact with poireau.py
with signals:
SIGHUP
: prints stacks for all live sampled allocations.SIGUSR1
: prints stacks for all old sampled allocations and stop reporting them in the future.SIGUSR2
: prints stacks for all recent calls tofree
orrealloc
.
Disable the tracepoints with
sudo perf probe --del sdt_libpoireau:*
and remove libpoireau from perf
's cache with
sudo perf buildid-cache --remove ./libpoireau.so
to erase all traces of libpoireau from the perf
subsystem.
If you had to edit an init script to insert the LD_PRELOAD
variable
before executing a program, it makes sense to undo the edit and
restart the instrumented program as soon as possible.
You can override the default sample rate (every 32 MB on average) by
setting the POIREAU_SAMPLE_PERIOD_BYTES
to a positive sample rate
in bytes.
Poireau can also be used to take a snapshot of live sampled
allocations (and print it) whenever the estimated heap footprint
reaches a new high water mark. Simply pass a
--track-high-water-mark
argument to poireau.py
; poireau.sh
consumes the first argument if any and passes the rest to
poireau.py
. For system-wide tracing, pass *
as the first argument
to poireau.sh
.
The poireau.py
analysis script accepts a second positional argument
after --track-high-water-mark
: that's the minimum size (in bytes)
at which it will report live sampled allocations.
Poireau can also be used with perf record
for short-lived tasks;
that's particularly useful with high water mark heap profiling. First
record perf.data
with perf record -T -e std_libpoireau:* --call-graph=dwarf -- ./profilee ...
, then pipe the data to analysis with
perf script | ./poireau.py ...
.
When LD_PRELOAD
ed, libpoireau intercepts every call to
malloc
/calloc
/realloc
/free
, and quickly forwards the vast
majority of calls to the real implementation that would be used if
libpoireau were absent.
Only those allocations that are marked for sampling are diverted, in
the case of malloc
and calloc
, and free
is overridden iff called
on an allocation that was diverted. Finally, realloc
is treated as
a pair of malloc
and free
, for sampling purposes.
The sampling logic simulates a process that samples each allocated
byte with equal probability. The (hardcoded) sampling rate aims for
an average of sampling one allocation every 32 MB; for example, we an
allocation request for 100 bytes becomes part of the sample with the
same probability as if we had flipped 100 times a biased coin that
lands on "head" with probability 1 / (32 * 1024 * 1024)
, and decided
to make the request part of the sample if any of these coin flip had
landed on "head."
This memory-less sampling strategy makes it possible to derive statistical bounds on the shape of heap allocation calls, even with an adversarial workload. However, a naive implementation is slow. Rather than flipping biased coins for each allocated byte, we instead generate the number of consecutive "tails" results by generating values from an Exponential distribution.
Whenever a call to malloc
, calloc
, or realloc
is picked for
sampling, libpoireau executes code that is instrumented with USDT
(user statically-defined tracing) probes. Linux perf
can annotate
that code to generate events (this is a system-wide switch, for every
process that linked the shared library); we use these events to let
the kernel capture callstacks for each sampled call.
In addition, these allocation requests are diverted to an internal
tracking allocator. This lets us identify calls to free
and
realloc
on tracked allocations, which is crucial to generate paired
USDT events ("this allocation was freed or reallocated"); it also
ensures we pass these allocations back to the backup tracking
allocator, rather than the system malloc.
Performance sensitive programs tend to avoid dynamic memory allocation
in hot spots. That being said, here are a couple microbenchmark to
try and upper bound the overhead of LD_PRELOAD
ing in
libpoireau.so
, by repeatedly making pairs of calls to malloc
and
free
(a best case for most memory allocators) in a single thread.
The results below were timed on an unloaded AMD EPYC 7601 running
Linux 5.3.11 and glibc 2.27.
Large allocations (1 MB), with a sample period of 32 MB (p = 3.2%):
baseline (glibc malloc): 0.092 us/malloc-free (0.047 user, 0.046 system)
preloaded, no probe: 0.153 us/malloc-free (0.058 user, 0.094 system)
preloaded, with probes: 0.236 us/malloc-free (0.067 user, 0.169 system)
preloaded, with tracing: 0.271 us/malloc-free (0.069 user, 0.203 system)
This is pretty much our worst case: we expect to trigger allocation
tracking very frequently, once every 32 allocation, and our tracking
allocator is slightly more complex than a plain mmap
/munmap
(something we should still improve).
Mid-sized allocations (16 KB), with a sample period of 32 MB (p = 0.049%):
baseline (glibc malloc): 0.042 us/malloc-free (0.041 user, 0.001 system)
preloaded, no probe: 0.044 us/malloc-free (0.043 user, 0.001 system)
preloaded, with probes: 0.046 us/malloc-free (0.042 user, 0.004 system)
preloaded, with tracing: 0.054 us/malloc-free (0.042 user, 0.012 system)
At this less unreasonable size, the overhead of diverting sampled
allocations to a tracking allocator is less that 5%. We can also
observe that, while triggering an interrupt whenever we execute a
tracepoint isn't free, the time spent servicing the interrupt is
relatively small (< 20%) compared to the time it takes to generate a
backtrace. This isn't surprising, since we use the same part of the
kernel that's exercised when analysing performance issues with perf
.
Small-sized allocations (128 B), with a sample period of 32 MB (p = 0.00038%):
baseline (glibc malloc): 0.017 us/malloc-free (0.017 user, 0.000 system)
preloaded, no probe: 0.020 us/malloc-free (0.020 user, 0.000 system)
preloaded, with probes: 0.020 us/malloc-free (0.020 user, 0.000 system)
preloaded, with tracing: 0.020 us/malloc-free (0.020 user, 0.000 system)
Here, all the slowdown is introduced by trampolining from our interceptor malloc to the base system malloc.
TL;DR: in allocation microbenchmarks, the overhead of libpoireau instrumentation is on the order of 5-20% for small or medium allocations, and goes up to ~70% for very large allocations.
Enabling allocation tracing adds another 0-20% for small or medium allocations, and ~130% for very large allocations.
These are worst-case figures, for a program that does nothing but
repeatedly malloc
and free
in a loop. In practice, a performance
sensitive program hopefully spends less than 10% of its time in memory
management (and much less than that in large allocations), which means
the total overhead introduced by libpoireau and capturing stack traces
is probably closer to 1-5%.
libpoireau includes code derived from xoshiro 256+ 1.0, written in 2018 by David Blackman and Sebastiano Vigna (vigna@acm.org) and dedicated to the public domain.
libpoireau includes Systemtap's sys/sdt.h
, a file
dedicated to the public domain.