Significant overheads observed for `waitForInput`
Closed this issue · 3 comments
We experimented with 4 Spicy variants that each read 8 bytes in a unit and observed that waitForInput
(and/or waitForInputOrEod
) seems to add non-trivial runtime overhead. Below we first describe the 4 cases and the runtimes observed with a benchmarking pcap even with reduced arguments to std::string
(see #1589 by @awelzel).
a) Case 1:
type PDU = unit {
: uint64;
};
b) Case 2:
type PDU = unit {
: uint8[] &count=8;
};
c) Case 3:
type PDU = unit {
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
};
d) Case 4:
type PDU = unit {
: bytes &size=8;
};
With an end-to-end test (with Zeek version 6.2.0-dev.137
and Spicy version spicyc v1.9.0 (8cb591d1)
), we observed the following runtimes with hyperfine
for 5 runs
without any edits to the generated C++ code:
Case-1: Time (mean ± σ): 6.108 s ± 0.069 s
Case-2: Time (mean ± σ): 16.374 s ± 0.082 s
Case-3: Time (mean ± σ): 15.722 s ± 0.109 s
Case-4: Time (mean ± σ): 8.229 s ± 0.133 s
We then compiled the generated C++ code after reducing the arguments to std::string
to just one character, and observed the following runtimes:
Case-1: Time (mean ± σ): 9.650 s ± 0.236 s
Case-2: Time (mean ± σ): 22.896 s ± 0.078 s
Case-3: Time (mean ± σ): 21.650 s ± 0.188 s
Case-4: Time (mean ± σ): 13.864 s ± 0.340 s
We are not sure why the runtimes for manually compiled hlto is more than spicyz generated hlto, but the trend in execution runtime remains the same.
I am unable to reproduce this. Can you describe how you arrived at the conclusion that these differences are due to waitForInput
@Mohan-Dhawan?
I used the following modified grammar:
module foo;
type PDU1 = unit {
: uint64;
};
type PDU2 = unit {
: uint8[] &count=8;
};
type PDU3 = unit {
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
: uint8;
};
type PDU4 = unit {
: bytes &size=8;
};
public type P1 = unit { : PDU1[]; };
public type P2 = unit { : PDU2[]; };
public type P3 = unit { : PDU3[]; };
public type P4 = unit { : PDU4[]; };
I then compiled this with a RelWithDebInfo
build of Spicy like so:
spicyc -j foo.spicy -o foo.hlto
As input I used about 1MB of random data:
$ dd if=/dev/random of=input bs=1024 count=1024
1024+0 records in
1024+0 records out
1048576 bytes transferred in 0.007385 secs (141987271 bytes/sec)
With that I see the following runtimes which definitely differ:
$ hyperfine -w 3 -P I 1 4 './bin/spicy-driver -f input foo.hlto -p foo::P{I}'
Benchmark 1: ./bin/spicy-driver -f input foo.hlto -p foo::P1
Time (mean ± σ): 23.5 ms ± 0.7 ms [User: 20.9 ms, System: 2.1 ms]
Range (min … max): 22.2 ms … 25.3 ms 119 runs
Benchmark 2: ./bin/spicy-driver -f input foo.hlto -p foo::P2
Time (mean ± σ): 72.0 ms ± 2.7 ms [User: 68.4 ms, System: 2.8 ms]
Range (min … max): 68.5 ms … 77.6 ms 41 runs
Benchmark 3: ./bin/spicy-driver -f input foo.hlto -p foo::P3
Time (mean ± σ): 64.7 ms ± 1.7 ms [User: 61.3 ms, System: 2.7 ms]
Range (min … max): 62.6 ms … 68.6 ms 42 runs
Benchmark 4: ./bin/spicy-driver -f input foo.hlto -p foo::P4
Time (mean ± σ): 36.3 ms ± 0.6 ms [User: 33.6 ms, System: 2.2 ms]
Range (min … max): 35.0 ms … 37.9 ms 77 runs
Summary
./bin/spicy-driver -f input foo.hlto -p foo::P1 ran
1.55 ± 0.05 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P4
2.76 ± 0.11 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P3
3.07 ± 0.15 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P2
Looking at e.g., the differences between e.g., P2
and P1
the main differences I see are in that different types are parsed (e.g., PDU1
's uint8
is allocated on the stack while PDU2
's dynamic list uint8[]
needs a heap allocation). AFAICT all differences in waitForInput
usage are down to PDU1
extracting a single field while PDU2
extracts 8; overall they appear pretty close and not that prominent in the perf trace though.
@bbannier it seems you did reproduce the overheads for the 4 cases. The only reason to suspect waitForInput
is because nothing else in the generated C++ code seemed to jump out and attract attention. We had already included @awelzel's workaround for std::string
constructor. Is there a reliable way to determine the overhead of stack switching to really confirm that waitForInput
is not the bottleneck?
it seems you did reproduce the overheads for the 4 cases.
I see, but wouldn't go so far as call these "overheads" since the different parsers do different things so this is an apples to oranges comparison. My suggestion would be to use a parser which extracts the data you require (and not more) so you don't pay for something you do not need.
The only reason to suspect
waitForInput
is because nothing else in the generated C++ code seemed to jump out and attract attention.
Yepp, the work is spread pretty evenly among things which need to happen here to parse the given grammars, and any differences in waitForInput
are due to different things being parsed (e.g., you would see more relative time spent in it if you are parsing eight fields as opposed to one field). This is a fundamental limitation of microbenchmarks likes the ones here and IMO makes them less useful to clearly identify bottlenecks for real grammars (sometimes they are interesting starting points though).
Is there a reliable way to determine the overhead of stack switching to really confirm that
waitForInput
is not the bottleneck?
The most expensive part of stack switching (at least with default fiber behavior) is in saving and restoring the stacks. This happens in StackBuffer::save
and StackBuffer::restore
. For the large private grammar you are looking at this makes up about 10% of the parsing work.
I am closing issue this for now since I don't think something can be done about the differences you reported.