fcsonline/drill

Profile-Guided Optimization (PGO) results

zamazan4ik opened this issue · 1 comments

Hi!

I am doing a research of Profile-Guided Optimization (PGO) benefits on different software (results are here). I optimized drill with PGO too (via cargo-pgo) and want to share my results.

Test environment

  • Fedora 38
  • Linux kernel 6.3.7
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Rustc: 1.70.0
  • drill version: the latest main commit for now (dfd5548c8d4269d5fa8b73e81d616572e9a9d445)

Benchmark

As a benchmark, I used the server from example/server and drill with drill --benchmark benchmark.yml --stats (the only change to the benchmark.yml was iteration count - increased to 10000). I compared Drill in Release mode vs Drill in Release + PGO. As a profiling load (to collect a profile) the same load was used.

Results

Firstly, I want to highlight that methodology is not ideal since the CPU core is not overloaded so I measured the "average" CPU load by drill on one core (by htop) utility and checked with my eyes during every run (yeah, some scripting over top can be used here but right now I am quite lazy :). The lower the average CPU usage is - the better. This method could be improved but as a quick way - it should be good enough. All measurements were done on the same hardware/software, with the same "quiet" background load, multiple times, in different orders, etc - they are quite stable at least on my machine.

I show you results for "Release", "Release with PGO", and "Instrumentation" mode (Instrumentation just for history so you can estimate how Drill is slow in the Instrumentation mode):

  • Release: average CPU load is ~9.0 - 9.7% (less frequently 10.3%)
  • Release + PGO: average CPU load is ~7.8 - 8.4%
  • Instrumentation: average CPU load is ~15.5%

At least in this test, I see an improvement in Drill performance with PGO. If we can develop a way where Drill will be a CPU bottleneck itself in a "near real-life" case instead of NodeJS server - would be great to test it as well.

These results could be important for the persons who want to maximize benchmark tool performance per core/CPU/machine since it could help with postponing a moment when for benchmark purposes we need to spawn multiple machines to create a required stress load and/or just spawn cheaper instances to create the same load.

Another example of optimizing a benchmark tool with PGO from Goose is here.