llvm/llvm-project

PGO optimization results for LLVM-based projects

zamazan4ik opened this issue · 11 comments

Hi!

LLVM right now supports PGO only for Clang. I want to share PGO results for other LLVM-based projects.

Here I want to share my results with applying PGO on clangd (version - current main branch). According to my local tests on AMD 5900x/48 Gib RAM/Fedora 38/Clang 16 (for building clangd), clangd in Release mode without PGO finishes indexing llvm-project sources in ~10 minutes (9m55s), and clangd in Release mode with PGO (fprofile-instr-generate/-fprofile-instr-use) finishes indexing llvm-project in ~8 minutes (7m50s - 7m55s). Tests were performed multiple times, with clangd cache reset between runs, on the latest main branch. Compilation options: CC=clang CXX=clang++ cmake -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/zamazan4ik/open_source/install_clangd -DLLVM_INCLUDE_BENCHMARKS=1 -DCMAKE_POSITION_INDEPENDENT_CODE=1 -DLLVM_USE_LINKER=lld -G "Ninja" ../llvm-project/llvm (PGO mode just additionally sets -fprofile-instr-generate/-fprofile-instr-use options).

I think the results are quite good to add PGO support to clangd to the repository (as it's already done with clang itself). Another article about applying PGO on clangd you can find here: JetBrains blog.

Additionally want to note that Clangd in Instrumentation mode works toooo slow, so I just waited for indexing ~200 files from the LLVM repo. I think that's enough (and the benchmark confirms it).

@llvm/issue-subscribers-clangd

I think similar projects like Clang Tidy and Clang Static Analyzer also should have a performance boost from PGO.

On the same hardware/software setup I got the following results for optimizing Clang-Tidy (version - current main branch). As a bench workload, I chose to scan BOLT sources with all Clang-Tidy checks (except alpha). Measurements were done with time utility.

The results are the following:

  • Release: ./run-clang-tidy -checks="*" -p /home/zamazan4ik/open_source/build_bolt 5319,09s user 49,67s system 2142% cpu 4:10,58 total
  • Release + PGO: ./run-clang-tidy -checks="*" -p /home/zamazan4ik/open_source/build_bolt 4691,02s user 53,03s system 2041% cpu 3:52,43 total
  • (just for the history) Instrumentation mode (-fprofile-instr-generate): ./run-clang-tidy -checks="*" -p /home/zamazan4ik/open_source/build_bolt 12124,95s user 65,60s system 2057% cpu 9:52,44 total

FWIW, I don't think this is a clangd specific issue, but rather a release issue, and probably it should be discussed as an RFC in discourse rather than here. As people can already have whatever profiling they want when building LLVM, and I believe the proposal here is to have some PGO with the default LLVM releases and/or the way distros are building LLVM related packages.
So this extra complexity should be vetted by people that maintain those build systems/scripts.


regarding clangd, I believe the indexing is not the most latency-sensitive workflow we have. It's done in the background, on the idle cores and despite being quite useful for a good experience, having a full index of the codebase as soon as possible is not detrimental to clangd functionality.

the rather latency-sensitive workflows actually involve interactive interactions like code completion/signature help and main file ast builds (similar to indexing, but with a preamble).

so i think having some profiling based on those interactions would be better, but I guess any profiles are fine as long as they don't clearly regress performance of latency-sensitive interactions i mentioned above.

FWIW, I don't think this is a clangd specific issue, but rather a release issue, and probably it should be discussed as an RFC in discourse rather than here. As people can already have whatever profiling they want when building LLVM, and I believe the proposal here is to have some PGO with the default LLVM releases and/or the way distros are building LLVM related packages.
So this extra complexity should be vetted by people that maintain those build systems/scripts.

Right now I am talking about improving PGO optimization for other LLVM projects to the same level as Clang has now (CMake-specific scripts in the LLVM repo). I am not talking about the builds in different Linux distributions - that's a different talk for every distro since they need to choose the balance between "performance improvements" and "maintainability costs".

regarding clangd, I believe the indexing is not the most latency-sensitive workflow we have. It's done in the background, on the idle cores and despite being quite useful for a good experience, having a full index of the codebase as soon as possible is not detrimental to clangd functionality.

Anyway, it improves a lot experience with checkouting pretty large codebases on the local machines and switching between fast-evolving branches (in this case reindexing for multiple files is required, so even in this case the UX will be improved as well).

the rather latency-sensitive workflows actually involve interactive interactions like code completion/signature help and main file ast builds (similar to indexing, but with a preamble).

I didn't perform such benchmarks but Jetbrains did (warning - on Windows): link.

Another LLVM project - LLD.

As a test project, I chose ClickHouse. It has a large binary (2.3 Gib in Release mode, unstripped). So as a benchmark I link ClickHouse in Release mode with ThinLTO. For this test case, I have the following results (lld version - current main branch):

  • LLD in Release mode: 6915,33s user 43,37s system 902% cpu 12:50,84 total
  • LLD in Release mode + PGO: 6084,39s user 47,80s system 901% cpu 11:19,96 total
  • (just for history) LLD in the Instrumentation mode links ClickHouse in 13 HOURS. So yes, LLD in the Instrumentation mode is quite slow.

All other flags are the same. Hardware is the same as above. PGO mode - -fprofile-instr-generate/-fprofile-instr-use. I want to highlight that for training was used partial LLD profile since I didn't want to wait 13 hours once again so I had been waiting for 1-2h and then dumped profile from LLD and applied on the optimization phase.

One more - clang-format.

The results are shown in time utility format. Command to measure performance: time find llvm_project/llvm -iname \*.h -o -iname \*.cpp | xargs clang-format --style=Google -i . This command was used for Instrumentation too.

  • Release: 48,64s user 0,97s system 98% cpu 50,377 total
  • Release + PGO: 40,29s user 0,99s system 98% cpu 42,023 total

All other flags are the same. Hardware is the same as above. PGO mode: -fprofile-instr-generate/-fprofile-instr-use.

Another LLVM project - LLDB.

As a benchmark, I used bt all command on ClickHouse Release binary. The algorithms was the following:

  • Run lldb clickhouse server start
  • Enter run command
  • Wait for the ClickHouse initialization (something like 10 secs)
  • Stop ClickHouse with Ctrl+C
  • Run bt all and measure time

For Release LLDB I got the result in 8m50s average. For Release + PGO - in 8m32s. Not a huge improvement but anyway. An interesting detail - lldb from Fedora repo performs bt all on the same binary almost instantly. Maybe some different defaults or something like that - IDK yet.

LLDB was built locally with CC=clang CXX=clang++ cmake -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;lldb" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/zamazan4ik/open_source/install_lldb_optimized -DCMAKE_CXX_FLAGS="-fprofile-instr-use=/home/zamazan4ik/open_source/llvm_profiles/lldb.profdata" -DCMAKE_C_FLAGS="-fprofile-instr-use=/home/zamazan4ik/open_source/llvm_profiles/lldb.profdata" -DCMAKE_POSITION_INDEPENDENT_CODE=1 -DLLVM_USE_LINKER=lld -G "Ninja" ../llvm-project/llvm

If you have a better idea for a measurable workload for LLDB - feel free to share it with me.

@kadircet according to my benchmarks, can we update the page https://llvm.org/docs/HowToBuildWithPGO.html with results about PGO benefits on other LLVM parts, not just Clang and compile time? I think it's a good thing to know for the users as well.

Another suggestion - put information about LLVM BOLT effects on LLVM projects as well (e.g. based on this benchmark).

Why I am asking about updating the documentation? Because for the users it's much easier to find a source of the info, instead of searching over the GitHub issues.

Would be interesting to see numbers for AutoFDO.
It's not feasible to run an instrumented clangd build in production but with sampling it might give a pretty accurate profile after a while.

Would be interesting to see numbers for AutoFDO.

Agree! But right now I have no hardware with LBR/BRS support to test it (AutoFDO needs it). According to multiple readings, AutoFDO results should be almost the same as with instrumentation. However, according to the Google papers, AutoFDO is a little bit less efficient than instrumentation from the performed optimizations perspective.

It's not feasible to run an instrumented clangd build in production but with sampling it might give a pretty accurate profile after a while.

I would argue with that statement. Feasible or not - it totally depends on the use case. The case when we run instrumented clangd once on some workload, collect profiles and then PGO-optimize clangd is totally fine if we update clangd rarely (so we do not need to run instrumented clangd often to update the profiles). There are multiple mitigation strategies to reduce instrumentation drawbacks in production like training only on a small but representative-enough workload subset.

One more point against AutoFDO - the quality of the AutoFDO converter itself. You can find multiple issues in the AutoFDO upstream like google/autofdo#179 or google/autofdo#162 (and others). It also could bring some problems.

If we are talking about "PGO at scale" (as it's used in Google), there is another problem - lack of tooling. In Google paper almost all tooling around their PGO approach is closed-sourced (like profiles collectors, storage, etc.) and there is no open-source alternative yet.

I also believe that the AutoFDO approach is a more practical way to apply PGO in production but right now it has many limitations that should be carefully considered. That's why I am saying that instrumented PGO is still a completely fine way for doing PGO in practice.