Should support profile-guided optimization
kaomakino opened this issue · 9 comments
We should have a mechanism to build FDB with profile-guided optimization (PGO). Our preliminary benchmark results showed 4-12% performance improvement depends on the workload.
GET RANGE with the range of 10 keys: 3.71%
GET RANGE with the range of 50 keys: 12.32%
GET and SET on the same key: 9.62%
SET a new unique key: 6.08%
Mix of 8 GETs, 1 GET & SET, 1 SET: 4.17%
Selecting the most effective instrumentation workload should be discussed separately.
When we used PGO in conjunction with LTO, the gains were more significant.
GET RANGE with the range of 10 keys: 5.55% (was 3.71% w/o LTO)
GET RANGE with the range of 50 keys: 18.43% (12.32%)
GET and SET on the same key: 10.56% (9.62%)
SET a new unique key: 7.05% (6.08%)
Mix of 8 GETs, 1 GET & SET, 1 SET: 6.07% (4.17%)
Is the above performance gain the best performance gain by having PGO and LTO?
I was unaware of the existence of https://llvm.org/docs/CommandGuide/llvm-profdata.html, which means for whatever benchmark we end up using, we can run an ssd and a memory config (or others, to make sure our code is actually run), and then combine them before feeding it back into the compiler. Was your profile generated from an everything-in-one-process run? Simulation?
I do agree that deciding what workload to use to generate the profile is a potentially difficult question in and of itself.
We don't know whether those numbers are the best we can get, but probably not so off, because we used the same benchmark as the instrumentation workload for PGO. We used our standalone fdb_c binding benchmark against an everything-in-one-process fdbserver. (Not simulation)
While PGO is definitely an interesting optimization option to explore, I want to point out that based on the perf gain shown in the preliminary tests above, it might not be enough to justify the effort since the following(quoted from Clang user's manual):
Code that is not exercised in the profile will be optimized as if it is unimportant, and the compiler may make poor optimization choices for code that is disproportionately used while profiling.
I thought PGO is something like Cost Based Optimization
used in Database's planner which basically accelerates certain queries and treat others as normal ones based on statistics, but it sounds like that PGO will actually punish the code that were treated as unimportant
based on the work load used during profiling. This makes me feel that unless we see a really huge perf gain, probably at least 50%, we should be really really careful.
If a code is not exercised, or never been instrumented, the compiler should apply the same optimization as if PGO isn't used. What we need to be careful is when instrumentation does execute a code path, but not in a preferable way. For example, if there's an if statement like this:
if (error) {
errorHandling;
} else {
normalExecution;
}
If the instrumentation workload takes the error handling path, then PGO will favor the error handling path, and the normal execution path will be penalized by an unnecessary branch miss.
That is why we need a careful discussion about the workload we use for the instrumentation.
By the way, 50% from PGO is little too unrealistic. 10-15% is pretty good, in my opinion, where the baseline build was built with -O3
.
Could anyone clarify the current status of PGO on FoundationDB? According to the results with many other projects ( including databases like PostgreSQL, Redis, MongoDB, ClickHouse), PGO helps a lot with achieving better performance.
If we are not ready right now to integrate somehow PGO into the build process, can we at least write a note in the FoundationDB documentation about PGO? In this case, users and maintainers will know an additional way to achieve better performance with FDB. Here are the examples of such documentation in other projects:
- GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- Clang:
- ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
- Databend: https://databend.rs/doc/contributing/pgo
- Vector: https://vector.dev/docs/administration/tuning/pgo/
- Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
As an additional idea, I can suggest trying to test LLVM BOLT as an additional post-PGO optimization step. More materials about PGO, BOLT, and other related stuff can be found in https://github.com/zamazan4ik/awesome-pgo .
Friendly pinging @kaomakino (as a TS), and @jzhou77 @xis19 @kakaiu as active FDB contributors.
FDO is supported for Clang builds.
-DPROFILE_INSTR_GENERATE=on
CMake option enables the instrumentation flag, then you can build the generate_profile
target that builds the instrumentation build and runs a given workload to generate profile data in the fdbmonitor
build phase.
(You may want to modify the profiling workload in contrib/generate_profile.sh
)
Then, -DPROFILE_INSTR_USE=<profile>
will use the profile data generated above to build the final binaries with FDO.
We have also evaluated BOLT with gcc (by passing -Wl,--emit-relocs
), but Clang's FDO provided better performance, so we did not proceed with BOLT back then.
FDO is supported for Clang builds.
-DPROFILE_INSTR_GENERATE=on CMake option enables the instrumentation flag, then you can build the generate_profile target that builds the instrumentation build and runs a given workload to generate profile data in the fdbmonitor build phase.
(You may want to modify the profiling workload in contrib/generate_profile.sh)
Then, -DPROFILE_INSTR_USE= will use the profile data generated above to build the final binaries with FDO.
Great! Did you measure performance improvements from this on FoundationDB? If yes, could you please share the results? Are the results the same as in the starting post in this issue?
Also, would be great if you add to the documentation the information about building FoundationDB with FDO. In this case, users and/or maintainers will know about an additional way to optimize FDB performance. Here are some examples:
- ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
- Databend: https://databend.rs/doc/contributing/pgo
- Vector: https://vector.dev/docs/administration/tuning/pgo/
- Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
- GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- Clang:
We have also evaluated BOLT with gcc (by passing -Wl,--emit-relocs), but Clang's FDO provided better performance, so we did not proceed with BOLT back then.
Did you test BOLT as an addition to FDO (optimize with BOLT already optimized with FDO binary)? According to my tests with YDB (ydb-platform/ydb#140 (comment)) and Rustc results - it helps (Rustc is already optimized with FDO + BOLT on Linux platform). Did you test BOLT after FDO on Clang build?
Are provided here FDB binaries optimized with FDO or not?