Profile-Guided Optimization (PGO) results
zamazan4ik opened this issue · 1 comments
Hi!
I am doing a research of Profile-Guided Optimization (PGO) benefits on different software (results are here). I optimized drill
with PGO too (via cargo-pgo) and want to share my results.
Test environment
- Fedora 38
- Linux kernel 6.3.7
- AMD Ryzen 9 5900x
- 48 Gib RAM
- SSD Samsung 980 Pro 2 Tib
- Rustc: 1.70.0
- drill version: the latest
main
commit for now (dfd5548c8d4269d5fa8b73e81d616572e9a9d445
)
Benchmark
As a benchmark, I used the server from example/server
and drill
with drill --benchmark benchmark.yml --stats
(the only change to the benchmark.yml
was iteration count - increased to 10000). I compared Drill in Release mode vs Drill in Release + PGO. As a profiling load (to collect a profile) the same load was used.
Results
Firstly, I want to highlight that methodology is not ideal since the CPU core is not overloaded so I measured the "average" CPU load by drill
on one core (by htop
) utility and checked with my eyes during every run (yeah, some scripting over top
can be used here but right now I am quite lazy :). The lower the average CPU usage is - the better. This method could be improved but as a quick way - it should be good enough. All measurements were done on the same hardware/software, with the same "quiet" background load, multiple times, in different orders, etc - they are quite stable at least on my machine.
I show you results for "Release", "Release with PGO", and "Instrumentation" mode (Instrumentation just for history so you can estimate how Drill is slow in the Instrumentation mode):
- Release: average CPU load is
~9.0 - 9.7% (less frequently 10.3%)
- Release + PGO: average CPU load is
~7.8 - 8.4%
- Instrumentation: average CPU load is
~15.5%
At least in this test, I see an improvement in Drill performance with PGO. If we can develop a way where Drill will be a CPU bottleneck itself in a "near real-life" case instead of NodeJS server - would be great to test it as well.
These results could be important for the persons who want to maximize benchmark tool performance per core/CPU/machine since it could help with postponing a moment when for benchmark purposes we need to spawn multiple machines to create a required stress load and/or just spawn cheaper instances to create the same load.
Another example of optimizing a benchmark tool with PGO from Goose is here.