zeek/spicy

ParsingState::_process: Overhead due to unconditional profiling calls - fmt particularly?

Closed this issue · 4 comments

I've generated an artificial ~850MB pcap with 16x50 MB downloads via QUIC and separately a ~100MB pcap with ~12k small QUIC connections for perf testing of the Spicy QUIC analyzer.

Drilling into perf output / flamegraphs, there was visible overhead in ParsingState::_process invoking tinyformat and allocation of output streams and strings and std::local. There might be some overhead due to the profile calls itself, but that isn't explicitly visible probably due to inlining.

After naively commenting out the profiler::start / profiler::stop calls in spicy/runtime/src/driver.cc, runtime of a zeek -r execution for the 850MB pcap improved by ~20% (10.5s to 8.5s) and for the small connections pcap by a bit more than 3% (15.0s to 14.5s).

For the QUIC parser, after the first few packets, the payload packets are mostly just consumed via skip &eod unless something interesting comes along later on. As there's not much else to do, otherwise small overhead shows up pretty significant.

Even a 3.3% overhead appears large for unused functionality. Maybe the formatted strings could be cached on the parser instance if the profiling otherwise has negligible perf impact. Otherwise, compilation flags to opt-out/opt-in for for production/release might be another consideration.


Testing a dns-only pcap with the spicy-dns analyzer and the same modification improves runtime by ~2.6% (22.6 to 22.0). Not overly much, but DNS's event to packet ratio is rather high.

rsmmr commented

@awelzel can you see if #1576 is good enough to address this?

Hey Robin, using the string concatenation improves things quite a bit, but not doing it seems still faster.

With latest master and the quic-16-50mb-transfers.pcap pcap with -C -b base/protocols/quic takes 8.7 seconds, your branch takes 7.53 seconds instead.

Using the patch below (not pretty) shapes off another four hundred milliseconds from 7.53 seconds down to 7.11 seconds, so I'd vote for something that doesn't do std::string + std::string if not absolutely necessary ;-)

Here's a conn.log entry for one of the 16 connections:

#fields ts      uid     id.orig_h       id.orig_p       id.resp_h       id.resp_p       proto   service duration        orig_bytes      resp_bytes      conn_state      local_orig      local_resp      missed_bytes    history orig_pkts       orig_ip_bytes   resp_pkts       resp_ip_bytes       tunnel_parents
1696593130.800633       CD6gm53SnyieQzKDXc      172.31.0.1      46620   172.31.0.2      443     udp     ssl,quic        6.871709        188601  51467433        SF      T       T       0       Dd      5299    336973  40554   52602945        -

Basically some 350k ~340k packets per connection - that just stresses the _process() method I'm sure. The pcap is on the os-perf-1 system in case you want to pull it.

diff --git a/spicy/runtime/src/driver.cc b/spicy/runtime/src/driver.cc
index 098d550d..2264f831 100644
--- a/spicy/runtime/src/driver.cc
+++ b/spicy/runtime/src/driver.cc
@@ -175,7 +175,9 @@ Result<spicy::rt::ParsedUnit> Driver::processInput(const spicy::rt::Parser& pars
         in.read(buffer, static_cast<std::streamsize>(len));
 
         {
-            auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input/") + parser.name);
+            auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+                                hilti::rt::profiler::start(std::string("spicy/prepare/input/") + parser.name) :
+                                std::optional<hilti::rt::Profiler>();
 
             if ( auto n = in.gcount() )
                 data->append(hilti::rt::Bytes(buffer, n));
@@ -251,7 +253,9 @@ driver::ParsingState::State driver::ParsingState::_process(size_t size, const ch
         switch ( _type ) {
             case ParsingType::Block: {
                 DRIVER_DEBUG("block", size, data);
-                auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/block/") + _parser->name);
+                auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+                                    hilti::rt::profiler::start(std::string("spicy/prepare/block/") + _parser->name) :
+                                    std::optional<hilti::rt::Profiler>();
 
                 auto input = hilti::rt::reference::make_value<hilti::rt::Stream>(data, size);
                 input->freeze();
@@ -288,7 +292,9 @@ driver::ParsingState::State driver::ParsingState::_process(size_t size, const ch
                     return Done;
                 }
 
-                auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + _parser->name);
+                auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
+                                    hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + _parser->name) :
+                                    std::optional<hilti::rt::Profiler>();
 
                 if ( ! _input ) {
                     // First chunk.
rsmmr commented

Using the patch below (not pretty) shapes off another four hundred milliseconds from 7.53 seconds down to 7.11 seconds, so I'd vote for something that doesn't do std::string + std::string if not absolutely necessary ;-)

Yeah, I hear you. I was hoping the compiler would be able to optimize this better. Can you do one more measurement for me: what's the runtime if you do hilti::rt::profiler::start with just an (arbitrary) static string, no concatenation. I'm wondering how much overhead is there just in the call out to start (which your patch prevents as well).

Hey Robin - I think the following is easier and more focused than zeek -r. Micro-benchmarking the various versions. Are you able to compile/modify that on your end?

Here, profiler::start with an existing std::string or const char* is faster than the pre-check. The concat approach is more than 100 times slower than using just parser.name for profiler::start() input. Pre-computing the strings would help. The pre-check turns out to be slower with GCC compared to parser.name, with clang-15 the pre-check is faster. But magnitudes faster than "tinyformat" or "concat".

Hope that helps.

$ ./profiler_benchmark 
2023-10-20T10:51:22+02:00
Running ./profiler_benchmark
Run on (8 X 1780.82 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.35, 0.34, 0.51
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_tinyformat                    717 ns          717 ns       970839
BM_concat                       80.9 ns         80.9 ns      8624761
BM_std_string_immediate         29.9 ns         29.9 ns     23481370
BM_std_string_parser_name      0.558 ns        0.558 ns   1000000000
BM_static_const_char           0.558 ns        0.558 ns   1000000000
BM_pre_check_off                1.12 ns         1.12 ns    615904000
BM_pre_check_on                  382 ns          382 ns      1833823
#include <string>

#include <benchmark/benchmark.h>
#include <hilti/rt/fmt.h>
#include <hilti/rt/profiler.h>

class Parser {
public:
    Parser(std::string name) : name(name) {}
    std::string name;
};

static void BM_tinyformat(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = hilti::rt::profiler::start(hilti::rt::fmt("spicy/prepare/input/%s", parser.name));
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_concat(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input") + parser.name);
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_std_string_immediate(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = hilti::rt::profiler::start(std::string("spicy/prepare/input"));
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_std_string_parser_name(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = hilti::rt::profiler::start(parser.name);
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_static_const_char(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = hilti::rt::profiler::start("spicy/prepare/input");
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_pre_check_off(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = false;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
                            hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + parser.name) :
                            std::optional<hilti::rt::Profiler>();
        hilti::rt::profiler::stop(profiler);
    }
}

static void BM_pre_check_on(benchmark::State& state) {
    hilti::rt::init();
    hilti::rt::detail::globalState()->profiling_enabled = true;
    auto parser = Parser("x");
    for ( auto _ : state ) {
        auto profiler = ::hilti::rt::detail::unsafeGlobalState()->profiling_enabled ?
                            hilti::rt::profiler::start(std::string("spicy/prepare/stream/") + parser.name) :
                            std::optional<hilti::rt::Profiler>();
        hilti::rt::profiler::stop(profiler);
    }
}

BENCHMARK(BM_tinyformat);
BENCHMARK(BM_concat);
BENCHMARK(BM_std_string_immediate);
BENCHMARK(BM_std_string_parser_name);
BENCHMARK(BM_static_const_char);
BENCHMARK(BM_pre_check_off);
BENCHMARK(BM_pre_check_on);

BENCHMARK_MAIN();

Crude Makefile...

ZEEKBASE:=/opt/zeek-dev-prod/

profiler_benchmark: profiler_benchmark.cc
        $(CXX) $< -o $@ -O3 -std=c++17 $(CXXFLAGS) -L $(ZEEKBASE)/lib -I $(ZEEKBASE)/include $(ZEEKBASE)/lib/libhilti-rt.a -lhilti -lbenchmark