zeek/spicy

Significant overheads observed for `waitForInput`

Closed this issue · 3 comments

We experimented with 4 Spicy variants that each read 8 bytes in a unit and observed that waitForInput (and/or waitForInputOrEod) seems to add non-trivial runtime overhead. Below we first describe the 4 cases and the runtimes observed with a benchmarking pcap even with reduced arguments to std::string (see #1589 by @awelzel).

a) Case 1:

type PDU = unit {
	: uint64;
};

b) Case 2:

type PDU = unit {
	: uint8[] &count=8;
};

c) Case 3:

type PDU = unit {
	: uint8;
	: uint8;
	: uint8;
	: uint8;
	: uint8;
	: uint8;
	: uint8;
	: uint8;
};

d) Case 4:

type PDU = unit {
	: bytes &size=8;
};

With an end-to-end test (with Zeek version 6.2.0-dev.137 and Spicy version spicyc v1.9.0 (8cb591d1)), we observed the following runtimes with hyperfine for 5 runs without any edits to the generated C++ code:

Case-1: Time (mean ± σ): 6.108 s ± 0.069 s
Case-2: Time (mean ± σ): 16.374 s ± 0.082 s
Case-3: Time (mean ± σ): 15.722 s ± 0.109 s
Case-4: Time (mean ± σ): 8.229 s ± 0.133 s

We then compiled the generated C++ code after reducing the arguments to std::string to just one character, and observed the following runtimes:

Case-1: Time (mean ± σ): 9.650 s ± 0.236 s
Case-2: Time (mean ± σ): 22.896 s ± 0.078 s
Case-3: Time (mean ± σ): 21.650 s ± 0.188 s
Case-4: Time (mean ± σ): 13.864 s ± 0.340 s

We are not sure why the runtimes for manually compiled hlto is more than spicyz generated hlto, but the trend in execution runtime remains the same.

I am unable to reproduce this. Can you describe how you arrived at the conclusion that these differences are due to waitForInput @Mohan-Dhawan?


I used the following modified grammar:

module foo;

type PDU1 = unit {
    : uint64;
};

type PDU2 = unit {
    : uint8[] &count=8;
};

type PDU3 = unit {
    : uint8;
    : uint8;
    : uint8;
    : uint8;
    : uint8;
    : uint8;
    : uint8;
    : uint8;
};

type PDU4 = unit {
    : bytes &size=8;
};

public type P1 = unit { : PDU1[]; };
public type P2 = unit { : PDU2[]; };
public type P3 = unit { : PDU3[]; };
public type P4 = unit { : PDU4[]; };

I then compiled this with a RelWithDebInfo build of Spicy like so:

spicyc -j foo.spicy -o foo.hlto

As input I used about 1MB of random data:

$ dd if=/dev/random of=input bs=1024 count=1024
1024+0 records in
1024+0 records out
1048576 bytes transferred in 0.007385 secs (141987271 bytes/sec)

With that I see the following runtimes which definitely differ:

$ hyperfine -w 3 -P I 1 4 './bin/spicy-driver -f input foo.hlto -p foo::P{I}'
Benchmark 1: ./bin/spicy-driver -f input foo.hlto -p foo::P1
  Time (mean ± σ):      23.5 ms ±   0.7 ms    [User: 20.9 ms, System: 2.1 ms]
  Range (min … max):    22.2 ms …  25.3 ms    119 runs

Benchmark 2: ./bin/spicy-driver -f input foo.hlto -p foo::P2
  Time (mean ± σ):      72.0 ms ±   2.7 ms    [User: 68.4 ms, System: 2.8 ms]
  Range (min … max):    68.5 ms …  77.6 ms    41 runs

Benchmark 3: ./bin/spicy-driver -f input foo.hlto -p foo::P3
  Time (mean ± σ):      64.7 ms ±   1.7 ms    [User: 61.3 ms, System: 2.7 ms]
  Range (min … max):    62.6 ms …  68.6 ms    42 runs

Benchmark 4: ./bin/spicy-driver -f input foo.hlto -p foo::P4
  Time (mean ± σ):      36.3 ms ±   0.6 ms    [User: 33.6 ms, System: 2.2 ms]
  Range (min … max):    35.0 ms …  37.9 ms    77 runs

Summary
  ./bin/spicy-driver -f input foo.hlto -p foo::P1 ran
    1.55 ± 0.05 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P4
    2.76 ± 0.11 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P3
    3.07 ± 0.15 times faster than ./bin/spicy-driver -f input foo.hlto -p foo::P2

Looking at e.g., the differences between e.g., P2 and P1 the main differences I see are in that different types are parsed (e.g., PDU1's uint8 is allocated on the stack while PDU2's dynamic list uint8[] needs a heap allocation). AFAICT all differences in waitForInput usage are down to PDU1 extracting a single field while PDU2 extracts 8; overall they appear pretty close and not that prominent in the perf trace though.

@bbannier it seems you did reproduce the overheads for the 4 cases. The only reason to suspect waitForInput is because nothing else in the generated C++ code seemed to jump out and attract attention. We had already included @awelzel's workaround for std::string constructor. Is there a reliable way to determine the overhead of stack switching to really confirm that waitForInput is not the bottleneck?

it seems you did reproduce the overheads for the 4 cases.

I see, but wouldn't go so far as call these "overheads" since the different parsers do different things so this is an apples to oranges comparison. My suggestion would be to use a parser which extracts the data you require (and not more) so you don't pay for something you do not need.

The only reason to suspect waitForInput is because nothing else in the generated C++ code seemed to jump out and attract attention.

Yepp, the work is spread pretty evenly among things which need to happen here to parse the given grammars, and any differences in waitForInput are due to different things being parsed (e.g., you would see more relative time spent in it if you are parsing eight fields as opposed to one field). This is a fundamental limitation of microbenchmarks likes the ones here and IMO makes them less useful to clearly identify bottlenecks for real grammars (sometimes they are interesting starting points though).

Is there a reliable way to determine the overhead of stack switching to really confirm that waitForInput is not the bottleneck?

The most expensive part of stack switching (at least with default fiber behavior) is in saving and restoring the stacks. This happens in StackBuffer::save and StackBuffer::restore. For the large private grammar you are looking at this makes up about 10% of the parsing work.

I am closing issue this for now since I don't think something can be done about the differences you reported.