zeek/spicy

Significant std::string overhead for generated waitForInput(...,error, location,...) calls

Closed this issue · 3 comments

It's a micro-benchmark, but I believe there's something to it.

Compiling the reproducer t.spicy with its absolute filename (assuming an appropriate long parent directory) results in upwards of 10% slower parsing performance due to short-string-optimization of an immediate constructed std::string instance not being in effect. Not constructing any immediate std::string instances should help parsing performance even more. Particularly if they are not being used and not variable.

Reproducer at the bottom, executed within the latest zeek/zeek-dev container, turbo boost disabled.

Command Mean [s] Min [s] Max [s] Relative
cat test-data.txt | spicy-driver t.hlto 7.324 ± 0.063 7.254 7.376 1.00
cat test-data.txt | spicy-driver t-compiled-with-realpath.hlto 8.310 ± 0.074 8.246 8.392 1.13 ± 0.01

Background: Robin gave a demo showing generated C++ code of Spicy units and what tickled me were the two immediate std::string instances constructed for every call to waitForInput() for uint16 parsing:

:spicy::rt::detail::waitForInput(__data, __cur, ::hilti::rt::integer::safe<std::uint64_t>{2U}, std::string("expecting 2 bytes for unpacking value"), std::string("t.spicy:4:6"), ::hilti::rt::StrongReference<::spicy::rt::filter::detail::Filters>());

Compiling with an absolute path makes the second string exceed 15 bytes and therefore thwarts short-string optimization, causing an extra heap allocation for every waitForInput() call. The "expecting 2 bytes for unpacking value" also does an extra heap allocation that should probably be avoided.


Reproducer:

#!/bin/bash
set -eux

cat <<EOF > t.spicy
module Test;

type Bar = unit {
  g: uint16;
  h: uint16;
  i: uint16;

  j: uint16;
  k: uint16;
  l: uint16;

  m: uint16;
  n: uint16;
  o: uint16;

  p: uint16;
  q: uint16;
  r: uint16;
};

public type Foo = unit {
  : Bar[] &eod;
};
EOF

export HILTI_CXX_COMPILER_LAUNCHER=ccache

spicyc -j t.spicy -o t.hlto
# Encodes the full-filename into generated code, likely 15+ bytes.
spicyc -j $(realpath t.spicy) -o t-compiled-with-realpath.hlto

python3 -c 'print("xxxxxx" * 10000000, end="")' > test-data.txt
ls -lha test-data.txt

hyperfine --export-markdown results.md -M3 'cat test-data.txt | spicy-driver t.hlto' 'cat test-data.txt | spicy-driver t-compiled-with-realpath.hlto'

Great deep-dive! While the C++ code we call reasonably takes both error message and location by const ref,

extern void waitForInput(hilti::rt::ValueReference<hilti::rt::Stream>& data, // NOLINT(google-runtime-references)
const hilti::rt::stream::View& cur, uint64_t min, const std::string& error_msg,
const std::string& location,
hilti::rt::StrongReference<spicy::rt::filter::detail::Filters> filters);

when generating the code you pointed to above we instantiate values for them,

void ParserBuilder::waitForInput(const Expression& min, const std::string& error_msg, const Meta& location) {
builder()->addCall("spicy_rt::waitForInput", {state().data, state().cur, min, builder::string(error_msg),
builder::expression(location), _filters(state())});
}

This triggers allocations for information not crucial most of the time on the hot path (both builder::string and builder::expression here create temporary strings). HILTI currently does not have a non-owning string type, but if it did we could instead e.g., emit static tables owning locations/error messages which we could then reference cheaply (the values in these tables could even be constexpr const char[]s).

rsmmr commented

Yeah, this is pretty interesting, and I suspect we have more instances of this effect elsewhere. The code generator isn't very smart in this regard so far.

Another way to address this could be knowing when we could emit plain C-strings along with a string_view argument receiving things.

HILTI currently does not have a non-owning string type, but if it did we could instead e.g., emit static tables owning locations/error messages which we could then reference cheaply (the values in these tables could even be constexpr const char[]s).

I had wondered if a naive idea could be the following. The static within a function might not be totally for free, but probably cheaper than malloc/std::string construction.

static std::string location_4_6 = "t.spicy:4:6";
waitForInput(..., location_4_6, ...);