apple/foundationdb

Should support profile-guided optimization

kaomakino opened this issue · 6 comments

We should have a mechanism to build FDB with profile-guided optimization (PGO). Our preliminary benchmark results showed 4-12% performance improvement depends on the workload.

GET RANGE with the range of 10 keys: 3.71%
GET RANGE with the range of 50 keys: 12.32%
GET and SET on the same key: 9.62%
SET a new unique key: 6.08%
Mix of 8 GETs, 1 GET & SET, 1 SET: 4.17%

Selecting the most effective instrumentation workload should be discussed separately.

When we used PGO in conjunction with LTO, the gains were more significant.

GET RANGE with the range of 10 keys: 5.55% (was 3.71% w/o LTO)
GET RANGE with the range of 50 keys: 18.43% (12.32%)
GET and SET on the same key: 10.56% (9.62%)
SET a new unique key: 7.05% (6.08%)
Mix of 8 GETs, 1 GET & SET, 1 SET: 6.07% (4.17%)

Is the above performance gain the best performance gain by having PGO and LTO?

I was unaware of the existence of https://llvm.org/docs/CommandGuide/llvm-profdata.html, which means for whatever benchmark we end up using, we can run an ssd and a memory config (or others, to make sure our code is actually run), and then combine them before feeding it back into the compiler. Was your profile generated from an everything-in-one-process run? Simulation?

I do agree that deciding what workload to use to generate the profile is a potentially difficult question in and of itself.

We don't know whether those numbers are the best we can get, but probably not so off, because we used the same benchmark as the instrumentation workload for PGO. We used our standalone fdb_c binding benchmark against an everything-in-one-process fdbserver. (Not simulation)

While PGO is definitely an interesting optimization option to explore, I want to point out that based on the perf gain shown in the preliminary tests above, it might not be enough to justify the effort since the following(quoted from Clang user's manual):

Code that is not exercised in the profile will be optimized as if it is unimportant, and the compiler may make poor optimization choices for code that is disproportionately used while profiling.

I thought PGO is something like Cost Based Optimization used in Database's planner which basically accelerates certain queries and treat others as normal ones based on statistics, but it sounds like that PGO will actually punish the code that were treated as unimportant based on the work load used during profiling. This makes me feel that unless we see a really huge perf gain, probably at least 50%, we should be really really careful.

If a code is not exercised, or never been instrumented, the compiler should apply the same optimization as if PGO isn't used. What we need to be careful is when instrumentation does execute a code path, but not in a preferable way. For example, if there's an if statement like this:

if (error) {
  errorHandling;
} else {
  normalExecution;
}

If the instrumentation workload takes the error handling path, then PGO will favor the error handling path, and the normal execution path will be penalized by an unnecessary branch miss.
That is why we need a careful discussion about the workload we use for the instrumentation.

By the way, 50% from PGO is little too unrealistic. 10-15% is pretty good, in my opinion, where the baseline build was built with -O3.