Canop/broot

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

zamazan4ik opened this issue · 7 comments

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. Here you can find different applications from different domains that were accelerated with PGO: virtual machines (like QEMU and CrosVM), compilers, gRPC workloads, benchmark tools, databases, and much more. That's why I think it's worth trying to apply PGO to Broot. I ran some benchmarks and want to share my results.

Test environment

  • Hardware: Macbook M1 Pro
  • OS: macOS 13.4 Ventura
  • Rust: 1.72
  • Broot version: the latest main branch (1b5c1838b3a533cab390def547ef5cfb892c47f3 commit)

Benchmark

As an evaluation and training set, I used these benchmarks https://github.com/Canop/broot/tree/main/benches via cargo bench. PGO has trained also on these benchmarks with cargo pgo bench (see below the link to this awesome tool). All measurements were done with the same background noise (as much as I can guarantee on this OS).

Results

The results are presented in the cargo bench format. Since I do not know the correct way to copy these fancy tables properly, I will attach the screenshots (sorry for that).

Release run:
image

Instrumented compared to Release (here you can evaluate how benchmarks are slow with Instrumentation enabled):
image

Then I ran cargo bench once again with the Release version to restore the benchmarks state to a baseline Release.

Release + PGO optimized compared to Release:
image

As you see, PGO helps with achieving better performance at least in the provided by the project benchmarks.

Possible future steps

I can suggest the following things to do:

  • Evaluate PGO's applicability to the Broot binary itself (instead of the benchmarks).
  • If PGO helps to achieve better performance - add a note to Broot's documentation (the README file?) about that. In this case, users and maintainers will be aware of another optimization opportunity for Broot.
  • Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
  • Optimize prebuilt Broot binaries with PGO.

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

For the Rust projects, I recommend starting with cargo-pgo.

Canop commented

Running PGO to optimize up to 12% some specific tasks doesn't seem worth the potential degradation of non optimized ones, which is inherent to PGO.

Running PGO to optimize up to 12% some specific tasks doesn't seem worth the potential degradation of non optimized ones, which is inherent to PGO.

Right now there is no proof that PGO will be degraded important for the users' scenarios. You can check how PGO is integrated into other projects like Clang, Rustc, Python, and others (more integrations are here - https://github.com/zamazan4ik/awesome-pgo#pgo-showcases). If you have good coverage of all scenarios, you can collect multiple profiles, merge them, and then PGO will optimize for all scenarios. Even this generic merged scenario can be helpful with optimizing the program in general (e.g. completely the same thing does Rustc with its PGO pipeline).

If you think PGO profiles from cargo bench are not good enough - that's fair enough. That's why I suggested testing PGO directly on Broot's binary instead of the benchmarks.

If we are not able to collect generic-enough profiles - okay. We can perform PGO benchmarks, document the results in the documentation, and integrate PGO build mode into the build scripts. So the users/maintainers can decide on their own - do they want to optimize Broot with PGO or no.

I don't see why broot should use a experimental technology, hard to maintain for one dev, for few hypothetical speed when anyway broot is more likely to be bottleneck by OS/hardware anyway. I wonder if the benchmark test have even few IO in them.

Canop commented

I don't see why broot should use a experimental technology

Same feeling. I've never seen impressive results in my tests of PGO and it never seemed worth the pain. So I'm not going to invest here unless I see new results.

I don't see why broot should use a experimental technology

It depends on your definition of "experimental" :) If "experimental" means "new to Broot" - I agree. But PGO itself is not a novel technique at all. E.g. PGO was implemented in GCC somewhere near the 4.5 version (I am too young to remember such releases in practice), and Clang also implemented PGO for a while. Cannot quickly find when PGO was implemented in Rustc, but Rust's implementation fully relies on the LLVM one. From the usage perspective, PGO is used as an optimization technique for the project itself during years (good examples are all Chromium-based browsers, Clang/GCC/Rust itself, CPython). From the companies' perspective, Google and Facebook are major users of PGO. E.g. Google uses PGO (in Sampling mode aka AutoFDO, but that's just an implementation detail). About Google experience you can read here. So I do not agree that PGO is an experimental technology across the industry but agree that PGO adoption overall is less compared to "-O3" and "LTO" optimization options.

Update: forgot to mention LLVM BOLT. This technology I agree to consider this "experimental" even if Facebook/Meta has huge experience with adopting it to their servers. According to my tests, there are a lot of caveats with BOLT in practice like bugs, ridiculous memory consumption, etc.

hard to maintain for one dev

Of course, I cannot estimate on your side, how hard for is the maintenance of this thing. You can how PGO is integrated into other projects here. You have multiple options of how to integrate PGO into a project with different maintenance cost:

  • Test and document PGO effects on Broot performance. Usually needs to be done once and never (or veeeeeeery rarely) touched again
  • Integrate building the project with PGO as an opt-in feature. This kind of integration does not require regular maintenance either.
  • Add PGO profile generations and PGO optimizations into the CI. This way is usually a bit harder. How hard it is? Well, from my experience, the sample workload is not changed frequently so you don't need to touch these scripts regularly. E.g. Pydantic-core uses this approach: pydantic/pydantic-core#741

for few hypothetical speed when anyway broot is more likely to be bottleneck by OS/hardware anyway

That's why I showed you PGO improvement results on the Broot benchmarks :) If you think that these improvements are not important - okay, but in this case I do not understand why you have such benchmarks :D if you have CPU-bound benchmarks for something that means that they are important for you. Also, here you can see PGO improvements on other projects, some of them at first seems like IO-bound but get interesting improvements from PGO perspective like hurl results.

However, I agree that testing PGO directly on the Broot binary itself can be more interesting to see. I didn't do it yet. The issue is just an idea of how to (possibly) improve the performance - maybe someone could find this idea worth trying.

I've never seen impressive results in my tests of PGO

In general or in Broot? If we are talking in general, I have all PGO results for real-life applications here. For every showcase you can follow the link read PGO effects on the software performance. Sometimes it's large enough (20% usually in compilers-like workloads), sometimes much less (like DragonflyDB).

If we are talking about Broot. Yes, right now we only see improvement results in the project benchmarks, not directly in the Broot performance itself - it need to be tested as well.

Canop commented

I decided not to pursue this ATM. This might be revised later.