facebook/rocksdb

There are too many write stalls because the write slowdown stage is frequently skipped

Closed this issue ยท 17 comments

Expected behavior

Write slowdowns should be applied long before write stalls

Actual behavior

The bytes pending compaction value computed in EstimateCompactionBytesNeeded changes too rapidly. With write-heavy workloads (db_bench --benchmarks=overwrite) RocksDB can change within 1 or 2 seconds from no stalls (or some slowdown) to long periods of write stalls. Even worse are write stalls that last for 100+ seconds. With a small change (see PoC diff below) there are no long stalls and variance is greatly reduced.

The root cause is that CalculateBaseBytes computes sizes bottom up for all levels except Lmax. It uses max( sizeof(L0), options.max_bytes_for_level_base) for the L1 target size, and then from L1 to Lmax-1, the target size is set to be fanout times larger.

This leads to large targets when L0 is large because compaction gets behind. On its own that isn't horrible and might even be considered to be adaptive. However, when sizeof(L0) suddenly shrinks because L0->L1 compaction finishes then the targets can suddenly become much smaller because sizeof(L1) will be max_bytes_for_level_base rather than sizeof(L0). At that point the bytes pending compaction value can become huge because it is the sum of the per-level bytes pending compaction values (PLBPC) where PLBPC = (sizeof(level) - target_size(level)) X fanout.

We also need to improve visibility for this. AFAIK, it is hard or impossible to see the global bytes pending compaction, the contribution from each level and the target size for each level. I ended up adding a lot of code to dump more info to LOG.

Here is one example using things written to LOG, including extra monitoring I added. The first block of logging is at 20:23.49 and there are write slowdowns because bytes pending compaction ~= 25B while the slowdown trigger was ~=23B. The second line lists the target size per level they are 1956545174 or ~2B for L1, 5354149941 or ~5.3B for L2, 14651806650 or ~14.6B for L3, 40095148713 or ~40B for L4, 109721687484 or ~110B for L5. In this case the value for bytes pending compaction jumps from ~25B to ~85B in 2 seconds. In other tests I have seen it increase by 100B or 200B in a few seconds because L0->L1 completes.

2022/01/21-20:23:49.546718 7ff4d71ff700 [/version_set.cc:2687] ZZZ ECBN scores 24985540821 total, 25.0 totalB, 2.0 L0, 15.6 L1, 9.3 L2, 0.0 L3, 0.0 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:49.546733 7ff4d71ff700 [/version_set.cc:2703] ZZZ ECBN target 12 L0, 1956545174 L1, 5354149941 L2, 14651806650 L3, 40095148713 L4, 109721687484 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546737 7ff4d71ff700 [/version_set.cc:2716] ZZZ ECBN actual 1956545174 L0, 6988421099 L1, 10062248176 L2, 14614385201 L3, 4683299263 L4, 37515691925 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546773 7ff4d71ff700 [WARN] [/column_family.cc:984] [default] ZZZ Stalling writes because of estimated pending compaction bytes 24985540821 rate 5368703

But 2 seconds later after L0->L1 finishes the per-level target sizes are recomputed and they are much smaller: 141300841 or ~140M for L1, 654055230 or ~650M for L2, 3027499631 or ~3B for L3, 14013730948 or ~14B for L4, 64866946001 or ~65B for L5. The result is that the per-level contributions to bytes pending compaction are much larger for L2, L3 and L4 which triggers a long write stall. The ECBN scores line has the value for estimated_compaction_bytes_needed_ as total and totalB and then the per-level contributions. The ECBN actual line has the actual per-level sizes. These show that the increase comes from the sudden reduction in the per-level target sizes.

2022/01/21-20:23:51.893239 7ff4db3ff700 [/version_set.cc:3859] ZZZ set level_max_bytes: 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 300256990742 L6, 9 L7, 9 L8, 9 L9
2022/01/21-20:23:51.906933 7ff4db3ff700 [/version_set.cc:2687] ZZZ ECBN scores 84473373628 total, 84.5 totalB, 0.1 L0, 18.7 L1, 20.5 L2, 22.1 L3, 23.1 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:51.906951 7ff4db3ff700 [/version_set.cc:2703] ZZZ ECBN target 3 L0, 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.906954 7ff4db3ff700 [/version_set.cc:2716] ZZZ ECBN actual 141300841 L0, 6955042332 L1, 11844113742 L2, 21096345478 L3, 22752145110 L4, 46254106087 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.907149 7ff4db3ff700 [WARN] [/column_family.cc:925] [default] ZZZ Stopping writes because of estimated pending compaction bytes 84473373628

Steps to reproduce the behavior

A proof-of-concept diff to fix this is here. It computes level target from largest to smallest while the current code does it from smallest to largest.

Reproduction:

numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1642906118

numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --duration=3600 --threads=16 --merge_operator="put" --seed=1642907605 --report_file=bm.lc.nt16.cm1.d0/v6.26.1/benchmark_overwrite.t16.s0.log.r.csv

For 2 workloads (cpu-bound, IO-bound) repeated with 3 configs comparing v6.26.1 without and with the PoC fix listed above. The cpu-bound workload uses the command-line above except the number of keys is 40M. The IO-bound workload uses the command-line above (800M keys). Both were run on a large server with fast SSD, 64G of RAM and 36 HW threads.

all use:

--level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30

x1 uses:

--soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904

x2 uses:

--soft_pending_compaction_bytes_limit=49392123904 --hard_pending_compaction_bytes_limit=98784247808

x3 uses

--soft_pending_compaction_bytes_limit=98784247808 --hard_pending_compaction_bytes_limit=197568495616

Below, old is upstream without changes and new has the PoC diff. The diff fixes the stalls with a small benefit to average throughput. The worst case stall without the fix is between 218 and 319 seconds for the IO-bound workloads.

--- cpu-bound
qps     qps       max stall usecs
old     new       old     new
183449  261900    5651332 22779
170995  261053    161738  26681
176169  260813    133591  98030

--- IO-bound
qps     qps       max stall usecs
old     new       old     new
104364  110613    239833382 98609
109937  113630    318710946 124163
109735  114530    218813592 99256

The graphs have the per-interval write-rate using 5-second intervals. The red line is RocksDB v6.26.1 with the PoC fix, the blue line is RocksDB v6.26.1 as-is.

For the x1 config and IO-bound workload
x1 io

For the x1 config and CPU-bound workload
x1 mem

For the x2 config and IO-bound workload
x2 io

For the x2 config and CPU-bound workload
x2 mem

For the x4 config and IO-bound workload
x4 io

For the x4 config and CPU-bound workload
x4 mem

ajkr commented

Siying, Jay, and I discussed this a bit. I am probably the one slowing down the response to this issue due to being caught up in the philosophy of slowing down writes. Giving a quick update on where I am on that right now.

Take, for example, an arrival rate curve that would come close to violating limits on space, memory, or read-amp, but not quite reach those limits. The gradual slowdown phase is harmful to such workloads because it causes them to sleep while not protecting any limit. Worse, when arrival rate exceeds processing rate in the slowdown phase (as is likely to happen, otherwise we wouldn't be slowing down), the requests queue up and each one experiences progressively higher latencies, not just the latency of one sleep.

The graphs look much more stable with the proposed change. But does it mean RocksDB is more stable? I am not sure because each curve in the graphs shows a different arrival rate. I think it is important to fix the arrival rate (note: db_bench cannot do this) and look at the tail latencies. For example take this graph:

image

Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve with decent tail latencies? Or vice versa? I am not sure. All I know is the blue RocksDB could handle the blue curve smoothly, and the red one could handle the red curve smoothly.

One thing Siying mentioned today that stuck with me is maybe there are some systems that use RocksDB like a closed-loop benchmark (I suppose they have some control over arrival rate, e.g., by shedding excess requests to somewhere else). For those it indeed would be nice if we held a stable processing rate in face of pressure on the system. I need to think about this case further, and how it'd be done to not regress systems that have no control over arrival rate.

Some brief notes on past experiences:

  • In Pebble we only had hard limits (discussion: cockroachdb/pebble#164 (comment)) in response to limit violations (space, memory, read-amp).
  • Working with users, I have yet to hear user-visible problems that would be solvable by slowing down writes earlier. Domas wanted us to never stall.

WRT to gradual slowdowns being harmful, users can set the soft and hard limits to be close or far apart, or we could make RocksDB more clever about when to apply the soft limit.

The average throughput from the red curve was slightly higher than the blue curve. So I am confused by this.

Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve
with decent tail latencies?"

The red and blue curves will be largely unchanged for open-loop workloads, if made from the perspective of RocksDB. These were plotted from the client's perspective (client == db_bench) because the client makes it easier to get that info.

One achievable goal for a DBMS is to provide steady throughput from the DBMS perspective during overload, whether from open-loop or closed-loop workloads. RocksDB can't do that today and I think it should.

Working with users, I have yet to hear user-visible problems that would be solvable by slowing down
writes earlier. Domas wanted us to never stall.

But RocksDB does stall, badly. Whether or not the solution requires slowing writes in addition to stopping them, we need a solution that doesn't stop writes for 5 minutes during overloads.

ajkr commented

The average throughput from the red curve was slightly higher than the blue curve. So I am confused by this.

Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve
with decent tail latencies?"

Sorry let me try to clarify. I believe a "workload" is essentially an arrival rate curve. In a closed-loop benchmark, I believe a curve in these graphs represents both the system's processing rate and its arrival rate (since a request is generated as soon as the previous request completes). Since the curves for "x4.io.old" and "x4.io.new" are not identical, there are two different workloads under consideration here.

The blue curve represents a spiky workload, and the red curve represents a smooth workload. Consider if we send the spiky workload to both systems. We know "x4.io.old" can handle it without tail latencies impacted by sleeps since it processed that exact workload before. But we do not know that for "x4.io.new". At points where the blue curve exceeds the red curve like the long spike at the beginning, it seems unlikely "x4.io.new" can process that fast. And if it falls behind, requests pile up, effectively causing requests to experience longer and longer delays even though writes aren't stopped.

The average throughput increasing due to adding sleeps is interesting. But the question I have about those cases is, is it a fundamental improvement, or is it preventing the LSM from entering a bloated state that we inefficiently handle? I have thought about these cases previously and predict/hope it is the latter, mostly because it feels with more compaction debt we have more choices of what to compact, lower fanout, etc., that should allow us to process writes more efficiently.

But RocksDB does stall, badly. Whether or not the solution requires slowing writes in addition to stopping them, we need a solution that doesn't stop writes for 5 minutes during overloads.

Agreed. My point has only been that slowdowns can cause pileups of requests that experience similar or worse latency to the stop condition (in open-loop systems only), so we need to carefully consider changes that might cause more slowdowns than before. Or make it clear when they're useful / how to opt out. Right now there are some that can't be disabled like slowing down when N-1 memtables are full (and N > 3 I think).

I agree that we should have workloads with 1) open-loop and 2) varying arrival rates and I am happy to help on the effort to create them.

I suspect you are being generous in stating that RocksDB can handle the blue curve. A write request that arrives during any of those valleys (stall periods) will wait several minutes. So the workload it can accept has to be the same or a subset of the blue curve. Maybe we are assigning a different value to avoiding multi-minute stalls.

RocksDB might be doing the right thing when increasing the compaction debt it allows, but it isn't doing the right thing when reducing it. WRT average throughput increasing by adding sleeps, the issue is that RocksDB can't make up its mind about what the per-level target sizes should be. It increases the targets slowly as the L0 grows, but then suddenly decreases them when the L0 shrinks after a large L0->L1 compaction. Persistent structures can't change their shape as fast as that code asks them to change.

Filed #9478 for the benchmark client that does open-loop and varying arrival rates

One scenario for a benchmark client that sustains a fixed arrival rate with a multi-minute write stall.

  1. Assume the client needs 10 threads to sustain the target rate
  2. And the client checks every 100 milliseconds to determine whether new threads are needed because existing ones are stalled

Then a stall arrives, and 10 threads are created each 100 milliseconds, or 100 threads/second and 6000 threads/minute. So after a 3-minute stall there will be ~20,000 threads assuming my server has enough memory to spawn them. It will be interesting to see what happens when that dogpile is slowly un-piled.

in sysbench we implemented a "queue" for events arriving with the target rate. we do not create extra threads, but just place arrived into "queue"... when the stall happens, the queue will grow, and then we need to decide at what moment the queue becomes too long to return the response in the acceptable response time window

in sysbench we implemented a "queue" for events arriving with the target rate. we do not create extra threads, but just place arrived into "queue"... when the stall happens, the queue will grow, and then we need to decide at what moment the queue becomes too long to return the response in the acceptable response time window

That is interesting but it keeps the server from seeing too many concurrent query requests which used to be an exciting event back in the day.

WRT to the start of the bug, perhaps I can restate my wishes if per-level target sizes can be dynamically increased when the L0 gets too large because the write rate is high:

  1. I prefer an option to opt-out of that behavior. I want a stronger limit on space-amp and the current behavior means that space-amp can reach 2X (or even exceed it) when I expect it to be closer to 1.1X. There is no way to disable it today.

  2. Decreasing the per-level target sizes should be gradual. It is sudden today and that is the source of long write stalls.

I repeated a benchmark using RocksDB version 6.28.2 both as-is and with a hack. The hack does two things: disables intra-L0 compaction, disables dynamic adjust of the per-level target sizes. The test is IO-bound (database larger than RAM) and was repeated on two server types -- one with 18 CPU cores, one with 40 CPU cores. I used 16 client threads for the 18-core server and 32 client threads for the 40-core server.

While I did the full set of benchmarks with tools/benchmark.sh I only share the results for readwhilewriting and overwrite. I modified the scripts to use --benchmarks=$whatever,waitforcompaction to make sure that the work to reduce compaction debt would be assigned to each benchmark

The summary is that for overwrite:

  • throughput is similar with/without the hack
  • worst-case response time is much better with the hack (1.5 vs 439.467 seconds, 0.162 vs 252.337 seconds)
  • write-amp is larger with the hack (19.4 vs 16.6, 17.5 vs 13.6)
  • compaction wall clock seconds are larger with the hack (51090 vs 31901, 41856 vs 26979)
Legend:
* qps - operations/sec. Measures Get/s for readwhilewriting and Put/s for overwrite
* wMB/s - average write MB/s per iostat
* cpu/s - average CPU/s per vmstat (us + sy columns)
* p99.99 - p99.99 response time in seconds for Get with readwhilewriting and Put for overwrite
* pmax - max response time in seconds for Get with readwhilewriting and Put for overwrite
* cpu/q - cpu/s / qps
* w-amp - write amplification = (flush + compaction) / ingest
* comp.w - compaction wall clock seconds from Comp(sec) column in compaction IO stats

--- res.iobuf, 18-core

readwhilewriting with 10MB/s writer
        qps     wMB/s   cpu/s   p99.99  pmax    cpu/q
as-is   107365  174.1   25.6    0.0028  0.116   .000238
hack    107367  174.4   24.8    0.0028  0.109   .000230

overwrite with unlimited writers
        qps     wMB/s   cpu/s   p99.99  pmax    cpu/q   w-amp   comp.w
as-is   107030  575.0   23.6    0.0188  252.337 .000220 13.6    26979
hack    100263  698.4   30.1    0.0165    0.162 .000300 17.5    41856

--- res.iobuf, 40-core

readwhilewriting with 10MB/s writer
        qps     wMB/s   cpu/s   p99.99  pmax    cpu/q
as-is   213952  175.8   19.8    0.0012  0.022   .000092
hack    208109  189.8   19.8    0.0012  0.043   .000095

overwrite with unlimited writers
        qps     wMB/s   cpu/s   p99.99  pmax    cpu/q   w-amp   comp.w
as-is   110440  696.2   17.5    0.126   439.467 .000158 16.6    31901
hack    115744  894.4   21.7    0.484     1.515 .000187 19.4    51090

These are compaction IO stats from the end of db_bench --benchmarks=overwrite,waitforcompaction,stats

--- 18-core, overwrite

- as-is
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0    488.8     0.0    488.8     803.4    314.6       0.0   2.6    177.9    292.3   2814.45           2596.41     26449    0.106   1220M   212K       0.0       0.0
  L3      3/0   48.27 MB   0.8   1195.2   314.6    880.6    1190.5    309.8       0.0   3.8    279.7    278.6   4375.27           4169.31       288   15.192   2989M    10M       0.0       0.0
  L4     25/0   334.28 MB   1.0    414.0   234.8    179.2     411.5    232.3      75.0   1.8    247.9    246.4   1710.17           1460.89     10353    0.165   1035M  6179K       0.0       0.0
  L5    226/0    2.69 GB   1.0    711.9   307.3    404.6     567.2    162.6       0.0   1.8    109.7     87.4   6644.46           6414.37     18788    0.354   2595M    35M       0.0       0.0
  L6   1648/0   21.62 GB   1.0    529.6   161.7    367.9     500.8    132.9       0.9   3.1     93.3     88.2   5814.81           5652.66     10762    0.540   2390M   130M       0.0       0.0
  L7  11977/0   173.03 GB   0.0    587.1   133.9    453.2     452.3     -1.0       0.0   3.4    107.0     82.4   5619.90           5467.70      8783    0.640   2691M   601M       0.0       0.0
 Sum  13879/0   197.71 GB   0.0   3926.7  1152.3   2774.4    3925.6   1151.2      75.9  12.5    149.0    149.0  26979.06          25761.35     75423    0.358     12G   783M       0.0       0.0

- hack
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   43.37 MB   0.8      0.0     0.0      0.0     289.2    289.2       0.0   1.0      0.0    274.9   1077.29            895.43     20369    0.053       0      0       0.0       0.0
  L3      4/0   57.61 MB   0.9    528.5   289.1    239.4     528.1    288.8       0.0   1.8    261.4    261.2   2070.61           1908.04      3130    0.662   1319M   270K       0.0       0.0
  L4     23/0   344.10 MB   1.0   1030.5   287.2    743.4    1030.0    286.7       1.6   3.6    256.9    256.8   4107.93           3602.62     19039    0.216   2575M   910K       0.0       0.0
  L5    213/0    2.69 GB   1.0   1612.5   288.3   1324.2    1480.7    156.5       0.0   5.1     95.5     87.7  17286.75          16703.24     19599    0.882   6696M    14M       0.0       0.0
  L6   1620/0   21.63 GB   1.0    874.5   156.4    718.1     850.7    132.6       0.1   5.4     87.5     85.1  10239.91           9864.87     11159    0.918   3947M   107M       0.0       0.0
  L7  11788/0   173.04 GB   0.0    675.7   132.7    543.0     542.6     -0.4       0.0   4.1     97.8     78.5   7073.34           6736.39      9146    0.773   3101M   594M       0.0       0.0
 Sum  13651/0   197.79 GB   0.0   4721.7  1153.6   3568.1    4721.3   1153.2       1.7  16.3    115.5    115.5  41855.82          39710.60     82442    0.508     17G   717M       0.0       0.0

--- 40-core, overwrite

- as-is
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0    501.0     0.0    501.0     819.4    318.4       0.0   2.6    171.9    281.1   2985.09           2573.86     27224    0.110   1250M    45K       0.0       0.0
  L2      3/0   48.28 MB   0.8   1365.2   318.4   1046.8    1363.9    317.1       0.0   4.3    318.3    318.0   4391.70           4287.22       310   14.167   3412M  2295K       0.0       0.0
  L3     13/0   209.03 MB   1.0    418.8   232.3    186.5     418.2    231.7      84.8   1.8    285.4    285.0   1502.61           1387.38     11087    0.136   1047M  1335K       0.0       0.0
  L4    165/0    1.69 GB   1.0    745.9   316.5    429.4     603.0    173.6       0.0   1.9    128.3    103.8   5950.97           5837.25     18605    0.320   2728M  8153K       0.0       0.0
  L5   1051/0   13.56 GB   1.0    363.6   152.1    211.5     360.4    149.0      21.5   2.4    107.5    106.6   3461.65           3395.57      9220    0.375   1640M    13M       0.0       0.0
  L6   8127/0   108.52 GB   1.0    663.2   170.1    493.2     642.5    149.4       0.2   3.8    107.9    104.5   6293.90           6148.99     11212    0.561   2994M    93M       0.0       0.0
  L7  57094/0   868.18 GB   0.0    838.4   147.6    690.8     704.6     13.7       0.2   4.8    117.4     98.6   7315.33           7098.31      9852    0.743   3825M   572M       0.0       0.0
 Sum  66453/0   992.20 GB   0.0   4896.1  1336.9   3559.2    4912.1   1352.9     106.7  15.4    157.2    157.7  31901.25          30728.57     87510    0.365     16G   691M       0.0       0.0

- hack

Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0     334.1    334.1       0.0   1.0      0.0    155.5   2200.43           1074.00     22701    0.097       0      0       0.0       0.0
  L2      4/0   58.03 MB   0.9    568.9   334.1    234.8     568.5    333.7       0.0   1.7    288.4    288.1   2020.34           1929.23      3393    0.595   1419M    59K       0.0       0.0
  L3     14/0   209.12 MB   1.0    782.6   298.5    484.1     782.5    298.4      35.2   2.6    295.4    295.4   2712.68           2580.82     19220    0.141   1954M    88K       0.0       0.0
  L4    130/0    1.69 GB   1.0   1791.7   333.6   1458.1    1642.7    184.6       0.0   4.9    113.4    103.9  16184.93          15876.30     21644    0.748   7401M  1670K       0.0       0.0
  L5   1032/0   13.52 GB   1.0   1019.0   184.5    834.5    1015.8    181.4       0.1   5.5    103.1    102.8  10117.30           9907.30     13113    0.772   4595M    12M       0.0       0.0
  L6   8119/0   108.27 GB   1.0   1012.5   181.0    831.4     990.4    159.0       0.0   5.5    102.7    100.4  10099.69           9909.91     12707    0.795   4572M    99M       0.0       0.0
  L7  66668/0   866.31 GB   0.0    861.3   155.6    705.8     728.4     22.6       0.6   4.7    113.7     96.2   7754.44           7594.22     10891    0.712   3925M   563M       0.0       0.0
 Sum  75967/0   990.05 GB   0.0   6036.0  1487.3   4548.7    6062.4   1513.7      35.8  18.1    121.0    121.5  51089.82          48871.78    103669    0.493     23G   677M       0.0       0.0

So from the results above, can we get the efficiency benefits of the current code without the worst-case write stall?

Contributing to @mdcallag 's findings here, I can confirm that we are also observing with many versions of RocksDB (including 7.2) that for some types of constants input loads RocksDB can suddenly go into minute-long write stalls.
We have seen GetDelayTime() calculating write stall times of more than 20 minutes.

IMO such erratic behavior is bad from the user's perspective: requests to RocksDB work fine most of the time, but suddenly the processing time can go through the roof because of unpredictable write stalls. The tail latency times are way too high then. From the user perspective, I would also favor a more even throughput rate, even though it is not maximum. But it is much better to get constant ingestion rates rather than ingestion requests failing suddenly because of too high tail latencies.

We have put our own write-throttling mechanism in front of RocksDB to avoid the worst stalls, but ideally RocksDB should gracefully handle inputs and try to achieve a smooth input rate using the configured slowdown trigger values.
Would be very helpful if this issue could be followed up. Thanks!

@jsteemann - currently under discussion at #10057

@mdcallag : thanks, this is very useful! And thanks for putting a lot of time into benchmarking the current (and adjusted) behaviors!

Fixed by:
First PR: #10057
Second PR: #10270

I hope to use the db_bench tool to verify the effect of this repair. What workloads should I use to verify this fact. This is my current test.

/db_bench --benchmarks=fillseq --db=/rocksdb-data --wal_dir=/rocksdb-data --key_size=64 --value_size=512 --num=40000000 --threads=16 --initial_auto_readahead_size=16384 --max_auto_readahead_size=65536 --block_size=32768 --level0_file_num_compaction_trigger=2 --compaction_readahead_size=524288 --level0_slowdown_writes_trigger=36 --level0_stop_writes_trigger=48 --write_buffer_size=134217728 --target_file_size_base=134217728 --max_bytes_for_level_base=536870912 --use_direct_reads=true --bloom_bits=10 --compression_type=none --batch_size=32