zeek/spicy

string constructor performance

Closed this issue · 9 comments

I'll attach a sample, but I discovered that loading a pre-computed hash table got really slow on spicy.
After some false trails, I think its actually constructing strings that is the problem. Compare the two attached .spicy files in the zip, in the next comment.
one has 10k empty strings (fast) the other 10k one byte strings (slow) for me
on 7.0.0-dev.254. On 6.0.3 and 6.2.0 however they are both 28s.

I tried to solidify the further and got some confusing results, so will add more over time.
verified on the right build and path.
Retried with 10,000 element vectors that gives 28 sec vs 14 min, so not as dramatic but the same effect I guess.
time spicyz -o ex ex10000.spicy;
real 14m57.615s
user 14m17.929s
sys 0m42.904s
ex2.zip

time spicyz -o ex ex-null10000.spicy

real 0m27.564s
user 0m32.288s
sys 0m1.792s

My orig discovered issue took place in an "import" - just checked an import version and the effect is the same. (not worse). Just noting to rule that out.

note I was very loosely saying string above, when really it was bytestrings.

As I thought it would be interesting, I morphed it into a string test. On the plus side it wasnt in minutes, but clearly there is some interesing optimization happening:
ex-string-10000 38 s
ex-string-null-10000 7 s

so the one char long case dropped down to almost the null case for bytestrings, but the actual null case plummeted 4x as well, so it was actually 6x longer with one char in strings. These are consistent with 6.2 (same delta). So potential opt point, but not a problem....its something about bytestrings.

Since this issue is strictly about compilation performance it has not much to do with the performance of an actual constructor (these would be invoked at runtime, not at compile time). Instead it appears that the issue is that for big container literals we generate C++ code which is (very) expensive to compile (this depends on the exact C++ compiler used, but seems consistent). In particular for Spicy code like this

global xs = vector(1, 2, 3, 4, 5);

we generate C++ roughly equivalent to this

std::vector xs{1, 2, 3, 4, 5};

The elements are passed as an std::initializer_list to the std::vector constructor which for huge containers triggers the C++ compiler performance edge case seen (we are talking about vectors with 10,000 elements here). A similar issue exists for set or map literals.

I opened #1744 to generate C++ code which seems to be easier to digest for the compiler.

For the time being you could unroll the initialization of "huge" containers yourself, e.g., for a vector (set and map could follow a similar pattern),

global xs: vector<uint8>; # Construct empty `vector`.

# Calling `reserve` for a `vector` before inserting might improve runtime performance.
xs.reserve(5);

# Add elements.
xs.push_back(1);
xs.push_back(2);
xs.push_back(3);
xs.push_back(4);
xs.push_back(5);

Moving this to a dedicated function might improve readability and maintenance of your code (but still not allow initializing a const, see #1745).

vpax commented

I'd be surprised if this boils down to large vector initializers. I switched to using those for -O gen-C++ in order to speed up compilation speed, and a micro-benchmark I just tried on my Mac has clang++ -O3 compiling a vector initialized to a million elements in under a second.

vpax commented

I should add: in particular, the original code was like the unrolling you describe, and that took many minutes to compile, whereas using vectors is 100x faster.

also I think the string result (vs) bytestring argues against it being vector. Had another idea for benchmarking, will try later today.

I'd be surprised if this boils down to large vector initializers. I switched to using those for -O gen-C++ in order to speed up compilation speed, and a micro-benchmark I just tried on my Mac has clang++ -O3 compiling a vector initialized to a million elements in under a second.

I also benchmarked to be sure, so maybe I am not looking at the right thing. I compared code using an initializer list of a non-trival object (in Smoot's case a bytes, but for benchmarking a std::string)

std::vector<std::string> f() {
    auto xs = std::vector<std::string>{
        " ", // Repeated `n` times.
             // ...
    };

    return xs;
}

int main() { f(); }

and another version using push_back

    auto xs = std::vector<std::string>();
    xs.push_back(" "); // Repeated `n` times.
    // ../

I then compiled this with clang++ --std=c++17 -O3 ... and measured the times. I see quadratic behavior when calling the ctr taking a std::initializer_list, and roughly flat behavior for push_back (likely: time to compile operations just disappears in background).

Screenshot 2024-05-15 at 5 47 29 PM

If the std::vector contained a trivial type like int using an initializer or not makes no noticeable difference.

All this is roughly equivalent to what is happening for the reproducer; #1744 switches from initializers to insertion and I see an improved performance.

vpax commented

If the std::vector contained a trivial type like int using an initializer or not makes no noticeable difference.

That's presumably the difference. Your example above used ints so that's what I timed (and what the -O gen-C++ code uses, where the ints serve for table-driven initialization of more complex objects using much less code).