facebook/rocksdb

RocksDBStore.openDB performance degradation after version 6.27.3 on Mac

conan-goldsmith opened this issue · 7 comments

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

The openDB method is expected to have consistent timings across versions for all platforms.

Actual behavior

When testing performance between 6.27.3 and 6.29.4.1 the performance has been observed to consistently have a ~6X increase in operation time when running on a Mac.
This behavior, from testing, is isolated only to Mac computers. Windows and Linux hosts do not observe this increase in operation time.

Steps to reproduce the behavior

This was reproduced using Kafka Streams and analyzing the time simple tests require to compute and operate against rocksDB.
This was initially identified due to the rocksDB version being upgraded from version 3.1.2 and 3.3.1 of Kafka Streams, the reproduction is based on changing these version and the tested output references these versions.

By creating a simple KStreams operation that produces random data, or performs any stateful operation the rocksDB times are seen to increase dramatically when using the newer rocksDB version due to the openDB method.
For example, here are some KStream applications for testing, the only change needed between tests is altering the version used from Maven to recreate the behavior. https://docs.confluent.io/platform/current/streams/code-examples.html

In testing performed, the increase is a few seconds to 20+. This was reproduced on several Mac, Linux, and Windows machines. Here are some of the example output times recorded on the simple KStreams app we were using as a baseline.

Kafka  | OS             | Arch   | Java      | Duration (s)
3.1.2  | Mac OS X 10.16 | x86_64 | 1.8.0_311 | 3
3.3.1  | Mac OS X 10.16 | x86_64 | 1.8.0_311 | 22
3.1.2  | Mac OS X 12.6  | x86_64 | 11.0.16.1 | 4
3.3.1  | Mac OS X 12.6  | x86_64 | 11.0.16.1 | 27
3.1.2  | Windows 10 10.0  | amd64  | 16.0.2  | 5 
3.3.1  | Windows 10 10.0  | amd64  | 16.0.2  | 5
3.1.2  | Linux 5.10.102.1-microsoft-standard-WSL2  | amd64  | 1.8.0342  | 7
3.3.1  | Linux 5.10.102.1-microsoft-standard-WSL2  | amd64  | 1.8.0342  | 8
3.3.1  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 21
3.3.1  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 22
3.3.1  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 22
3.3.1  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 21
3.3.1  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 21
3.1.2  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 1
3.1.2  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 14
3.1.2  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 1
3.1.2  | Mac OS X 10.16  | x86_64  | 1.8.0_231  | 2

Please find a code that can be used to reproduce the issue attached. The actual topology does not matter, the only thing that matters is the version of Kafka libraries (and therefore Rocksdb).

code-to-reproduce.zip

The issue was reproduced on JDK 8 and 11 (kafka doesn't support 17 yet). The code can be built for 2 different kafka libraries at the same time using the command:

mvn clean; mvn -Drevision=3.3.1 package; mvn -Drevision=3.1.2 package

and then ran by commands

java -jar target/slow-test-demo-3.1.2.jar
java -jar target/slow-test-demo-3.3.1.jar

Please note that it's less of an issue on M1 macbooks, but can be clearly seen on Intel-based ones.

@conan-goldsmith Is this just an issue with open performance or at other times as well? I have noticed a performance degradation on the Mac related to the use of FULLSYNC. Would it be possible for you to build a version with FULLSYNC disabled (you will need to edit the Make/CMake build) and see how it performs?

Hi @conan-goldsmith @mrambacher I have had a look at this at the suggestion of @adamretter.

M1 support doesn't go back far enough to test the first version, but I've verified the performance change on an Intel Mac, and the clear finger-pointing at org.rocksdb.RocksDB.open via VisualVM profiling.

I then rebuilt RocksJava and Kafka Streams both from current main (on my M1 MBP)

  1. The problem is still present.
  2. Turning off F_FULLSYNC (the change introduced by #9356, which fits the timescale between the 2 RocksDB versions) fixes it.

There are 80 call to open() in the test run, and they range (64GB/2TB M1 MB Pro)
F_FULLSYNC : 0.12s --> 0.18s
NO F_FULLSYNC: 0.002s --> 0.004s
To be precise, that's by making calls to std::chrono::system_clock::now() before and after calling rocksdb_open_helper() in Java_org_rocksdb_RocksDB_open

That's a brutal change. Each open() call must indeed be waiting for a physical write to disk. I note that golang saw the same issues: golang/go#27415

Is this a problem that only occurs on Mac ? How can the required guarantees exist on linux without the performance impact ?

I presume @conan-goldsmith that a workaround for Kafka would involve concurrent threads making the open() calls. It would be for @mrambacher and the core team to consider if any remedial changes are feasible. Are all current uses of F_FULLSYNC necessary ?

@conan-goldsmith I do not know if the F_FULLSYNC is required. Perhaps F_BARRIERSYNC would be a better choice. Reading the documentation, F_BARRIERSYNC appears as if should be sufficient guarantees -- do you agree?

This issue was introduced by @riversand963 in #9356. It makes the tests on the Mac take an extremely long time (tests are doing more syncs than necessary) and causes other performance issues, as you noted.

There's more in https://news.ycombinator.com/item?id=30372218
What I infer (and the naming makes sense) is that

  1. linux fsync() flushes the specific file to disk
  2. MacOS fcntl(F_FULLSYNC) flushes the buffer cache (so all files), like linux syncfs()
  3. MacOS fcntl(F_BARRIERFSYNC) imposes a write ordering guarantee but doesn't guarantee that a write has been flushed to disk
  4. so (3) doesn't provide the strong guarantee that (1) does

The precise semantic guarantees promised by RocksDB and those expected by Kafka are not things I know, but MAYBE in some cases (e.g. open()) there are multiple flushes where only 1 needs to happen to guarantee results on disk. Multiple F_BARRIERFSYNCs followed by a single F_FULLSYNC would work for such a situation.

I ran four sets of "timed make check" tests on my M1 machine:
FULLSYNC: make check 4333.14s user 1929.06s system 66% cpu 2:37:10.46 total
BARRIER: make check 3254.61s user 1179.75s system 150% cpu 49:06.36 total
MIXED: make check 3404.71s user 1430.92s system 109% cpu 1:13:19.27 total
OFF: make check 3098.69s user 883.61s system 269% cpu 24:39.01 total

FULLSYNC and BARRIER results used F_FULLSYNC and F_BARRIERSYNC for all sync and fsync requests. OFF disabled this feature altogether (defaults to using fdatasync and fsync). MIXED used F_BARRIERSYNC for sync operations (when fdatasync is used on Linux) and F_FULLSYNC for fsync operations.

I do not know which mode is required for the proper semantic guarantees, but think F_FULLSYNC seems like a big hammer.