Runtime differences single- vs multi-threaded host applications
Closed this issue · 3 comments
This is again a micro-benchmark, but I think an interesting observation. Relates to zeek/zeek#3379.
When running spicy-driver
for micro-benchmarking, glibc is running in single-threaded mode, avoiding usage of atomic instructions for std::shared_ptr
. Within multi-threaded applications like Zeek, std::shared_ptr
usage is more expensive. Patching spicy-driver.cc
to start a very short-lived thread, thereby switching glibc into multi-threaded mode, the attached micro-benchmark runs 6% slower on my system due to what seems just use of atomic instructions for std::shared_ptr
.
Recording the spicy-driver run with perf record --call-graph dwarf
, the __gnu_cxx::__exchange_and_add
function is reported with ~5.8% samples as hottest function. In a zeek -r test with the QUIC analyzer, it shows up with ~3% samples.
Not quite sure there's something that can be fixed unless removal of std::shared_ptr is on the table, but opening mostly as FYI.
(Testing was done with code from #1590)
$ hyperfine -w 1 'cat test-data.txt | spicy-driver nested.hlto' 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
Benchmark 1: cat test-data.txt | spicy-driver nested.hlto
Time (mean ± σ): 11.364 s ± 0.067 s [User: 11.343 s, System: 0.038 s]
Range (min … max): 11.255 s … 11.463 s 10 runs
Benchmark 2: cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto
Time (mean ± σ): 12.011 s ± 0.140 s [User: 11.982 s, System: 0.036 s]
Range (min … max): 11.885 s … 12.357 s 10 runs
Summary
'cat test-data.txt | spicy-driver nested.hlto' ran
1.06 ± 0.01 times faster than 'cat test-data.txt | SPICY_THREAD=1 spicy-driver nested.hlto'
module Test;
type J = unit {
x: bytes &size=1;
};
type K = unit {
j: J;
};
type L = unit {
k1: K;
k2: K;
k3: K;
k4: K;
};
type M = unit {
l1: L;
l2: L;
l3: L;
l4: L;
};
public type N = unit {
: M[] &eod;
};
Patch to spicy-driver.cc:
--- a/spicy/toolchain/bin/spicy-driver.cc
+++ b/spicy/toolchain/bin/spicy-driver.cc
@@ -2,8 +2,11 @@
#include <getopt.h>
+#include <cstdio>
+#include <cstdlib>
#include <fstream>
#include <iostream>
+#include <thread>
#include <hilti/rt/libhilti.h>
@@ -289,6 +292,18 @@ int main(int argc, char** argv) {
if ( ! parser )
driver.fatalError(parser.error());
+ std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
+ if ( std::getenv("SPICY_THREAD") != nullptr ) {
+ std::fprintf(stderr, "starting thread\n");
+ std::thread t([]() {
+ std::fprintf(stderr, "running in thread\n");
+ });
+ t.join();
+ }
+
+ std::fprintf(stderr, "single threaded %x\n", __libc_single_threaded);
+
if ( auto x = driver.processInput(**parser, in, driver.opt_increment); ! x )
driver.fatalError(x.error());
}
Thanks for this issue @awelzel, this is a useful microbenchmarking result. We use std::shared_ptr
extensively in the runtime library, e.g., as control blocks in views and our safe iterators which we copy heavily; we definitely want to avoid unneeded overhead there.
I ran your reproducer with Clang and libc++ and I observe no significant overhead, and we do not want overhead for GCC and libstdc++ either.
Unassigning myself since ATM there is no clear path forward which doesn't involve a lot of work like implementing a standard libary-quality smart pointer library.
I don't see us changing the smart pointer any time soon, so closing because of the lack of a way forward.