Consider using LTO + PGO + Bolt

Question

Consider using LTO + PGO + Bolt

zamazan4ik opened this issue 2 years ago · 7 comments

Hi!

YDB right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".

Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.

Also, there are some caveats to consider like:

Increased build times
BOLT could be still unstable (or even broken) on some architectures

Links:

ScyllaDB results: scylladb/scylladb#10808
Vector results: vectordotdev/vector#15631
Rust experience with LTO + PGO + BOLT: https://kobzol.github.io/rust/rustc/2022/10/27/speeding-rustc-without-changing-its-code.html
Good chance to optimize build times of the project with PGO too: scylladb/scylladb#10985

Answer 1 · 2023-03-25T23:11:01.000Z

I did some performance experiments on my local machine.

My setup:

OS: Fedora 37
Linux kernel: 6.2.7
Compiler: clang-15 from Fedora packages (clang 15.0.7 (I've patched a few sources to support this compiler)
Hardware: Ryzen 9 5900X, 32 Gib RAM, SSD

For benchmark purposes and profile generation, I've used KqpLoad actor (https://ydb.tech/en/docs/development/load-actors-kqp) which I've run multiple times for 300 seconds each time (all other parameters are default). YDB setup - local with RAM storage as described here: https://ydb.tech/en/docs/getting_started/self_hosted/ydb_local but with my own ydbd binaries.

I did the following things:

Build the usual release build and benchmark it
Build the instrumented build, run the same benchmark over it and then compile again with the generated profiles with Clang PGO

The results are the following:

Usual release build: 28k TPS
PGO-optimized build with the same release flags: 35k TPS

Also, I've tried to apply BOLT but perf2bolt consumes more than 32 Gib RAM for ydbd binary so it was OOM-killed :(

Additional notes regarding PGO via instrumentation. During my profile generation with instrumented ydbd binary via KqpLoadActor I found a strange error, possibly due to hardcoded deadlines - see here: https://github.com/ydb-platform/ydb/blob/main/ydb/core/load_test/kqp.cpp#L332 Since instrumented binaries are much slower, some deadlines shall be adjusted. During my local benchmarking, I just commented out these deadlines and the profile was generated successfully. Possibly, would be better to have an ability to configure the timeout externally without code modification.

Answer 2 · 2023-03-27T00:15:36.000Z

Well, I managed to run BOLT with some "magic" options (details are here: llvm/llvm-project#61711).

As expected, BOLT didn't provide a significant performance boost after PGO - but still, I see measurable improvements:

PGO: 35k TPS
PGO + Bolt: 37k TPS

I think Propeller (an alternative approach, similar to BOLT but from Google) could bring almost the same numbers. I tried to test YDB with Propeller... But Propeller requires the latest Clang compiler from the main branch, and YDB has a bunch of compilation errors with it - and right now I have some motivation lack to fix them... Maybe, one day I will test it too :)

Answer 3 · 2023-04-01T11:06:24.000Z

Hi Alexander Zaitsev, thank you very much for sharing this excellent idea and making the initial experiments. One of our engineers have confirmed your results and working further on integration details. We will be back soon, when collect more data and understand best possible usage.

Answer 4 · 2023-08-27T03:08:26.000Z

@eivanov89 do you have updates regarding PGO? If you confirm the results and you find them useful, I suggest adding to the YDB documentation a note regarding tuning YDB with PGO. Here are the examples from other projects, how this documentation can look like:

GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/

Having this kind of information in the official documentation makes optimization opportunities more visible to the end users and maintainers.

Answer 5 · 2023-08-31T14:43:08.000Z

Hi @zamazan4ik, sorry for delay. We have some issues with our internal tools and build. Hope to solve soon though. But if fail, we will consider applying this to github build only.

Answer 6 · 2023-08-31T14:45:14.000Z

But if fail, we will consider applying this to github build only.

Understood. I suggest if you confirm the results above, add a note about PGO to the YDB documentation. So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

Answer 7 · 2023-09-11T09:36:57.000Z

So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

The tests that we both have used to test PGO are too narrow, imho. We're going to try YCSB and TPC-C to check if real benchmarks benefit same manner as microbenchmarks we have used so far.