ParsingState::_process: Overhead due to unconditional profiling calls - fmt particularly?
Closed this issue · 4 comments
I've generated an artificial ~850MB pcap with 16x50 MB downloads via QUIC and separately a ~100MB pcap with ~12k small QUIC connections for perf testing of the Spicy QUIC analyzer.
Drilling into perf output / flamegraphs, there was visible overhead in ParsingState::_process
invoking tinyformat and allocation of output streams and strings and std::local. There might be some overhead due to the profile calls itself, but that isn't explicitly visible probably due to inlining.
After naively commenting out the profiler::start
/ profiler::stop
calls in spicy/runtime/src/driver.cc
, runtime of a zeek -r
execution for the 850MB pcap improved by ~20% (10.5s to 8.5s) and for the small connections pcap by a bit more than 3% (15.0s to 14.5s).
For the QUIC parser, after the first few packets, the payload packets are mostly just consumed via skip &eod
unless something interesting comes along later on. As there's not much else to do, otherwise small overhead shows up pretty significant.
Even a 3.3% overhead appears large for unused functionality. Maybe the formatted strings could be cached on the parser instance if the profiling otherwise has negligible perf impact. Otherwise, compilation flags to opt-out/opt-in for for production/release might be another consideration.
Testing a dns-only pcap with the spicy-dns analyzer and the same modification improves runtime by ~2.6% (22.6 to 22.0). Not overly much, but DNS's event to packet ratio is rather high.
Hey Robin, using the string concatenation improves things quite a bit, but not doing it seems still faster.
With latest master and the quic-16-50mb-transfers.pcap
pcap with -C -b base/protocols/quic
takes 8.7 seconds, your branch takes 7.53 seconds instead.
Using the patch below (not pretty) shapes off another four hundred milliseconds from 7.53 seconds down to 7.11 seconds, so I'd vote for something that doesn't do std::string + std::string if not absolutely necessary ;-)
Here's a conn.log entry for one of the 16 connections:
#fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents
1696593130.800633 CD6gm53SnyieQzKDXc 172.31.0.1 46620 172.31.0.2 443 udp ssl,quic 6.871709 188601 51467433 SF T T 0 Dd 5299 336973 40554 52602945 -
Basically some 350k ~340k packets per connection - that just stresses the _process()
method I'm sure. The pcap is on the os-perf-1 system in case you want to pull it.
diff --git a/spicy/runtime/src/driver.cc b/spicy/runtime/src/driver.cc
index 098d550d..2264f831 100644
--- a/spicy/runtime/src/driver.cc
+++ b/spicy/runtime/src/driver.cc
@@ -175,7 +175,9 @@ Result<spicy::rt::ParsedUnit> Driver::processInput(const spicy::rt::Parser& pars
in.read(buffer, static_cast<std::streamsize>(len));
{
- auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input/") + parser.name);
+ auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+ hilti::rt::profiler::start(std::string("spicy/prepare/input/") + parser.name) :
+ std::optional<hilti::rt::Profiler>();
if ( auto n = in.gcount() )
data->append(hilti::rt::Bytes(buffer, n));
@@ -251,7 +253,9 @@ driver::ParsingState::State driver::ParsingState::_process(size_t size, const ch
switch ( _type ) {
case ParsingType::Block: {
DRIVER_DEBUG("block", size, data);
- auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/block/") + _parser->name);
+ auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+ hilti::rt::profiler::start(std::string("spicy/prepare/block/") + _parser->name) :
+ std::optional<hilti::rt::Profiler>();
auto input = hilti::rt::reference::make_value<hilti::rt::Stream>(data, size);
input->freeze();
@@ -288,7 +292,9 @@ driver::ParsingState::State driver::ParsingState::_process(size_t size, const ch
return Done;
}
- auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + _parser->name);
+ auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+ hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + _parser->name) :
+ std::optional<hilti::rt::Profiler>();
if ( ! _input ) {
// First chunk.
Using the patch below (not pretty) shapes off another four hundred milliseconds from 7.53 seconds down to 7.11 seconds, so I'd vote for something that doesn't do std::string + std::string if not absolutely necessary ;-)
Yeah, I hear you. I was hoping the compiler would be able to optimize this better. Can you do one more measurement for me: what's the runtime if you do hilti::rt::profiler::start
with just an (arbitrary) static string, no concatenation. I'm wondering how much overhead is there just in the call out to start
(which your patch prevents as well).
Hey Robin - I think the following is easier and more focused than zeek -r
. Micro-benchmarking the various versions. Are you able to compile/modify that on your end?
Here, profiler::start
with an existing std::string or const char* is faster than the pre-check. The concat
approach is more than 100 times slower than using just parser.name
for profiler::start()
input. Pre-computing the strings would help. The pre-check turns out to be slower with GCC compared to parser.name
, with clang-15 the pre-check is faster. But magnitudes faster than "tinyformat" or "concat".
Hope that helps.
$ ./profiler_benchmark
2023-10-20T10:51:22+02:00
Running ./profiler_benchmark
Run on (8 X 1780.82 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.35, 0.34, 0.51
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_tinyformat 717 ns 717 ns 970839
BM_concat 80.9 ns 80.9 ns 8624761
BM_std_string_immediate 29.9 ns 29.9 ns 23481370
BM_std_string_parser_name 0.558 ns 0.558 ns 1000000000
BM_static_const_char 0.558 ns 0.558 ns 1000000000
BM_pre_check_off 1.12 ns 1.12 ns 615904000
BM_pre_check_on 382 ns 382 ns 1833823
#include <string>
#include <benchmark/benchmark.h>
#include <hilti/rt/fmt.h>
#include <hilti/rt/profiler.h>
class Parser {
public:
Parser(std::string name) : name(name) {}
std::string name;
};
static void BM_tinyformat(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = hilti::rt::profiler::start(hilti::rt::fmt("spicy/prepare/input/%s", parser.name));
hilti::rt::profiler::stop(profiler);
}
}
static void BM_concat(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input") + parser.name);
hilti::rt::profiler::stop(profiler);
}
}
static void BM_std_string_immediate(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input"));
hilti::rt::profiler::stop(profiler);
}
}
static void BM_std_string_parser_name(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = hilti::rt::profiler::start(parser.name);
hilti::rt::profiler::stop(profiler);
}
}
static void BM_static_const_char(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = hilti::rt::profiler::start("spicy/prepare/input");
hilti::rt::profiler::stop(profiler);
}
}
static void BM_pre_check_off(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = false;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + parser.name) :
std::optional<hilti::rt::Profiler>();
hilti::rt::profiler::stop(profiler);
}
}
static void BM_pre_check_on(benchmark::State& state) {
hilti::rt::init();
hilti::rt::detail::globalState()->profiling_enabled = true;
auto parser = Parser("x");
for ( auto _ : state ) {
auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + parser.name) :
std::optional<hilti::rt::Profiler>();
hilti::rt::profiler::stop(profiler);
}
}
BENCHMARK(BM_tinyformat);
BENCHMARK(BM_concat);
BENCHMARK(BM_std_string_immediate);
BENCHMARK(BM_std_string_parser_name);
BENCHMARK(BM_static_const_char);
BENCHMARK(BM_pre_check_off);
BENCHMARK(BM_pre_check_on);
BENCHMARK_MAIN();
Crude Makefile...
ZEEKBASE:=/opt/zeek-dev-prod/
profiler_benchmark: profiler_benchmark.cc
$(CXX) $< -o $@ -O3 -std=c++17 $(CXXFLAGS) -L $(ZEEKBASE)/lib -I $(ZEEKBASE)/include $(ZEEKBASE)/lib/libhilti-rt.a -lhilti -lbenchmark