vectordotdev/vector

PGO applicability to Vector

zamazan4ik opened this issue ยท 13 comments

  • Please vote on this issue by adding a ๐Ÿ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

TL;DR: With PGO Vector got a boost from 300-310 k/s events to 350-370 k/s events!

Hi!

I am a big fan of PGO, so I've tried to use PGO with Vector. And I wanna share with you my current results. My hypothesis is the following: even for programs with LTO, PGO can bring HUGE benefits. And I decided to test it. From my experience, PGO especially works well with large codebases with some CPU-hot parts. Looks like Vector fits really well.

Test scenario

  1. Read a huge file with some logs
  2. Parse them
  3. Pass them to the blackhole.

This test scenario is completely real-life (except blackhole ofc :) ) and the log format with parse function are almost copied from our current prod env. We have patched flog tool to generate our log format (closed-source patch, sorry. I could publish it later if will be a need for it).

Example of one log entry:
<E 2296456 point.server.session 18.12 19:17:36:361298178 processCall We need to generate the solid state GB interface! (from session.cpp +713)

So Vector config is the following (toml):

[sources.in]
type = "file"
include = [ "/Users/zamazan4ik/open_source/test_vector_logs/data/*" ]
read_from = "beginning"
file_key = "file"
data_dir = "/Users/zamazan4ik/open_source/test_vector_logs"
 
[transforms.parser]
type = "remap"
inputs = [ "in" ]
source = """
.message = parse_regex!(.message, r'<(?P<level>[EWD]) (?P<thread>.+?) (?P<tag>[a-z.]+) (?P<datetime>[\\d.]+ [\\d:]*) (?P<function>[\\S]+) (?P<mess>.*) \\(from (?P<file>[\\S.]*) \\+(?P<line>\\d+)\\)')
"""
 
[sinks.out]
type = "blackhole"
inputs = [ "parser" ]

[api]
  enabled = true

You could say: "Test scenario is too simple", but:

  • I especially wanted to start with some minimal example to reduce noise from different unknown factors.
  • As I said before, for us it's a completely real-life example (just replace blackhole with smth like elasticsearch sink`)

Test setup

Macbook M1 Pro with macOS Ventura 13.1 with 6+2 CPU on ARM (AFAIK) + 16 Gib RAM + 512 Gib SSD. Sorry, I have no Linux machine near with me right now nor a desire to test it on Linux VM or Asahi Linux setup. However, I am completely sure that results will be reproducible on the "usual" Linux-based x86-64 setup.

How to build

Vector already uses fat LTO for the release build. However, local Release build and Release build on CI are different since local Release build does not use fat LTO (since it's tooooooooooooooooooooooo time consuming). So, do not forget to add the following flags to your Release build (got them from scripts/environment/release-flags.sh):

codegen-units = 1
lto = "fat"

For performing PGO build for Vector I've used this nice wrapper: https://github.com/Kobzol/cargo-pgo . You could do it manually if you want - I am just a little bit lazy :)

The guide is simple:

  • Install cargo pgo.
  • Run cargo pgo build. It will build the instrumented Vector version.
  • Run Vector with a test load like cargo pgo run -- -- -c /Users/zamazan4ik/open_source/test_vector_logs/vector.toml .
  • Wait for some time to finish. In my case, I generated nearly 2 Gib log file, so it completes the test plan in a minute, AFAIR.
  • Then just press ctrl+c to interrupt the Vector. The profile data will be generated somewhere in the target directory.
  • Run cargo pgo optimize. It will start the build again with the generated profile data.
  • Congratulations! After the successful build you will get a LTO + PGO release Vector version.

Is it worth it?

Yes! At least in my case, I have got a huge boost: from 300-310 k/s events (according to vector top) with default Vector release build with LTO flags from CI to 350-370 k/s with the same build + PGO enabled. So at least in my case - it's a huge boost.

The comparison strategy is simple: run just LTOed Vector binary, then LTOed + PGOed Vector binary (with resetting file checkpoint ofc). And measure the total time before the whole file will be processed + track metrics via vector top during the execution.

Results are stable and reproducible. I have performed multiple runs in different execution orders with the same results.

So what?

So what could we do with it?

  • At least consider adding PGO to CI. Yes, it has a LOT of caveats like a huge bump in a build time, good profile preparation, profile stability between releases, and much more other stuff but in my opinion, at least in some cases, it is definitely worth it.
  • For some users, who want "cheaply" try to boost their Vector performance. They could find this mini-guide and try to their log pipelines. Maybe will be a good idea to leave a note somewhere in the Vector documentation about this "advanced" option?

Possible future steps

Possible future steps for improving:

  • Perform more "mature" benchmarking, based on the current Vector benchmark infrastructure.
  • Try to play with BOLT. BOLT also could help with gaining more performance even from LTO + PGO build (but it's not guaranteed). This way has drawbacks like BOLT on some platforms is too unstable; BOLT could not support some architectures, etc. But it definitely a good tool to think about :)
  • Reduce somehow LTO time. I guess here could help more advanced linkers like lld or mold but I am not sure. AFAIK, mold has (or had, since this awesome linker evolves quickly) some caveats with LTO builds.

I hope the long read will be at least interesting for someone :) If you have any questions about it - just ask me here or on the official Vector Discord server (zamazan4ik nickname as well).

Hey @zamazan4ik, thanks for the extensive writeup! I know we've discussed this in the past, but it seems like it was probably internally on Slack as I didn't find any related issues. I'm also pretty sure we had looked at that cargo-pgo project ๐Ÿ˜„.

I don't quite remember why we didn't move forward (I think even with testing it), but it's interesting to see your results here.

cc @jszwedko @tobz @blt, as I'm guessing y'all were involved with that original discussion.

tobz commented

Getting a 10-15% performance boost for essentially a bit of extra CI time per release is certainly an incredibly good trade-off. I think the biggest thing would just be, as you've pointed out, doing all of the legwork to figure out what platforms we can/can't do PGO on, and creating the CI steps to do it for release/nightly builds.

I'd also be curious to figure out what workload is the best for PGO profiles. As an example: are any of our current soak/regression test workloads better/worse than what you used when locally testing? That sort of thing.

doing all of the legwork to figure out what platforms we can/can't do PGO on

Well, actually PGO has a good state across all major platforms (Linux, macOS, Windows). Probably the best source of truth regarding PGO state in Rust ecosystem is rustc project itself since they are investing a lot of resources into optimizing the Rust compiler (e.g. Rust 1.66 enabled BOLT optimization additionally to PGO on Linux platform).

and creating the CI steps to do it for release/nightly builds

Yes, it will be the most time-consuming and boring stuff IMO. Also, do not forget about at least x2 in the build time (instrumentation build + run on the test workload + optimizing build).

I'd also be curious to figure out what workload is the best for PGO profiles.

From my experience, I would say the most beneficial parts should be CPU-heavy workloads (obviously). PGO shows good results on the huge programs where we have a lot of different possible branches with a huge context. In this case, the compiler cannot make a good guess about hot-cold branching, real-life inlining, etc. That's where PGO shines. Long short story, I do not expect much performance gains in the IO-workloads (e.g. posting to ElasticSearch) simply because the network usually is much-much slower than CPU, and even if we will get a performance speed up here - we will not see it in real life.

I'd also be curious to figure out what workload is the best for PGO profiles.

I think that the ideal workload for a PGO profile should exercise all the components, or at least all the component subsystems, as there would be no benefit for those components that aren't exercised. It would probably be good to see some indication of code coverage with this too, something we are also lacking.

I think that the ideal workload for a PGO profile should exercise all the components, or at least all the component subsystems, as there would be no benefit for those components that aren't exercised. It would probably be good to see some indication of code coverage with this too, something we are also lacking.

Good suggestion. I just want to add that this work could be done in an iterative way: add baseline loads for the components step by step. In this case, we will be able to deliver PGO improvements incrementally and avoid waiting for completion work on the preparing baseline profile for all components at once.

@jszwedko do you want to mention PGO somewhere here in the documentation?

@jszwedko do you want to mention PGO somewhere here in the documentation?

That page is more about tuning the released Vector assets rather than recommendations that involve recompiling Vector.

I'd be happy to see us do this, but, as discussed above, it'll take some work.

@jszwedko I got some examples of how a PGO-oriented page could look like:

I think a similar approach could be used for Vector as well - just create a page with a dedicated note about PGO and put it in the Vector documentation.

Thanks for the links @zamazan4ik ! I've come around and agree that we could add this to the docs for advanced users that are able to compile Vector themselves and run example workloads. I could see it being a subpage under https://vector.dev/docs/administration/tuning/. Feel free to open a PR if you like ๐Ÿ™‚

I did some benchmarks LTO, PGO and BOLT benchmarks on Linux and want to share my numbers. The test scenario is completely the same as in #15631 (comment) .

Setup

My setup is the following:

  • Fedora 38
  • Linux kernel 6.4.12
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Rustc version: 1.71.1
  • Vector: from master branch on dc665666bcd4d3487ca3684fd2fe41d4415cea52 commit

Results

Unfortunately, I didn't manage to test LTO + PGO on Linux since on the current Rust version it's broken for some unknown yet reasons (see Kobzol/cargo-pgo#32 and llvm/llvm-project#57501 for further details). Hopefully, this will be fixed in the future.

So I did some measurements on different LTO configurations with BOLT. The provided time is the time to complete the test scenario (process the same input file with file source and do some heavy regex-based transforms). The results are the following:

  • Vector release with codegen-units = 1 and lto = "off": 2m48s
  • Vector release with codegen-units = 1 and lto = "fat": 2m13s
  • Vector release with codegen-units = 1 and lto = "off" + PGO Instrumentation: 27m37s
  • Vector release with codegen-units = 1 and lto = "fat" + PGO Instrumentation: 28m08s
  • Vector release with codegen-units = 1 and lto = "thin" + PGO Instrumentation: 28m17s
  • Vector release with codegen-units = 1 and lto = "off" + PGO optimized: 2m19s
  • Vector release with codegen-units = 1 and lto = "off" + PGO optimized + BOLT instrumented: 23m59s
  • Vector release with codegen-units = 1 and lto = "off" + PGO optimized + BOLT optimized: 2m19s
  • Vector release with codegen-units = 1 and lto = "off" + BOLT instrumented: 24m30s
  • Vector release with codegen-units = 1 and lto = "off" + BOLT optimized: 2m48s
  • Vector release with codegen-units = 1 and lto = "fat" + BOLT instrumented: 18m10s
  • Vector release with codegen-units = 1 and lto = "fat" + BOLT optimized: 2m11s

According to the results above, there are several conclusions:

  • LTO is important to enable for Vector - it brings a measurable performance boost (fortunately, it's already enabled in the CI release builds)
  • Instrumentation modes for PGO and BOLT are very slow. If you are going to optimize Vector with PGO and/or BOLT - you need to keep it in mind.
  • At least in the tested case, LLVM BOLT does not bring measurable results if PGO is used. There is an idea to try to play with BOLT options (see here, e.g. disable lite mode). But for now - I see no interesting benefits from BOLT for Vector. Also, please consider - I used BOLT via Instrumentation mode. There is a chance that BOLT via Sampling works better - needs to be tested (but rustc uses BOLT with Instrumentation in their CI pipeline)
  • LTO + PGO can be buggy on the compiler side. Let's wait for the clarifications from LLVM/Rust upstream regarding the issue. Since PGO shows improvements without LTO - I expect to get better results with LTO + PGO compared to simple LTO (as it's already proven on macOS).

@bruceg pinging you since you asked me regarding BOLT for Vector.

Thanks for this writeup, @zamazan4ik, that's great to see. Did lto = "fat" + PGO optimized work at all for you or did you hit a bug there? Is there an open issue regarding LTO + PGO in rustc?

Did lto = "fat" + PGO optimized work at all for you or did you hit a bug there? Is there an open issue regarding LTO + PGO in rustic?

Nope, it doesn't work right now due to a compilation error in the "LTO + PGO" combination. I've created the issue in Kobzol/cargo-pgo#32 and added a comment to LLVM possibly-related bug in llvm/llvm-project#57501 (comment) . I didn't create an issue yet about this behavior in rustc repo (maybe @Kobzol can add some details regarding the issue). If not - I will create an issue in rustc issue tracker as well.

Bug in the upstream regarding LTO + PGO: rust-lang/rust#115344