There are too many write stalls because the write slowdown stage is frequently skipped
Closed this issue ยท 17 comments
Expected behavior
Write slowdowns should be applied long before write stalls
Actual behavior
The bytes pending compaction value computed in EstimateCompactionBytesNeeded changes too rapidly. With write-heavy workloads (db_bench --benchmarks=overwrite) RocksDB can change within 1 or 2 seconds from no stalls (or some slowdown) to long periods of write stalls. Even worse are write stalls that last for 100+ seconds. With a small change (see PoC diff below) there are no long stalls and variance is greatly reduced.
The root cause is that CalculateBaseBytes computes sizes bottom up for all levels except Lmax. It uses max( sizeof(L0), options.max_bytes_for_level_base) for the L1 target size, and then from L1 to Lmax-1, the target size is set to be fanout times larger.
This leads to large targets when L0 is large because compaction gets behind. On its own that isn't horrible and might even be considered to be adaptive. However, when sizeof(L0) suddenly shrinks because L0->L1 compaction finishes then the targets can suddenly become much smaller because sizeof(L1) will be max_bytes_for_level_base rather than sizeof(L0). At that point the bytes pending compaction value can become huge because it is the sum of the per-level bytes pending compaction values (PLBPC) where PLBPC = (sizeof(level) - target_size(level)) X fanout.
We also need to improve visibility for this. AFAIK, it is hard or impossible to see the global bytes pending compaction, the contribution from each level and the target size for each level. I ended up adding a lot of code to dump more info to LOG.
Here is one example using things written to LOG, including extra monitoring I added. The first block of logging is at 20:23.49 and there are write slowdowns because bytes pending compaction ~= 25B while the slowdown trigger was ~=23B. The second line lists the target size per level they are 1956545174 or ~2B for L1, 5354149941 or ~5.3B for L2, 14651806650 or ~14.6B for L3, 40095148713 or ~40B for L4, 109721687484 or ~110B for L5. In this case the value for bytes pending compaction jumps from ~25B to ~85B in 2 seconds. In other tests I have seen it increase by 100B or 200B in a few seconds because L0->L1 completes.
2022/01/21-20:23:49.546718 7ff4d71ff700 [/version_set.cc:2687] ZZZ ECBN scores 24985540821 total, 25.0 totalB, 2.0 L0, 15.6 L1, 9.3 L2, 0.0 L3, 0.0 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:49.546733 7ff4d71ff700 [/version_set.cc:2703] ZZZ ECBN target 12 L0, 1956545174 L1, 5354149941 L2, 14651806650 L3, 40095148713 L4, 109721687484 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546737 7ff4d71ff700 [/version_set.cc:2716] ZZZ ECBN actual 1956545174 L0, 6988421099 L1, 10062248176 L2, 14614385201 L3, 4683299263 L4, 37515691925 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546773 7ff4d71ff700 [WARN] [/column_family.cc:984] [default] ZZZ Stalling writes because of estimated pending compaction bytes 24985540821 rate 5368703
But 2 seconds later after L0->L1 finishes the per-level target sizes are recomputed and they are much smaller: 141300841 or ~140M for L1, 654055230 or ~650M for L2, 3027499631 or ~3B for L3, 14013730948 or ~14B for L4, 64866946001 or ~65B for L5. The result is that the per-level contributions to bytes pending compaction are much larger for L2, L3 and L4 which triggers a long write stall. The ECBN scores line has the value for estimated_compaction_bytes_needed_ as total and totalB and then the per-level contributions. The ECBN actual line has the actual per-level sizes. These show that the increase comes from the sudden reduction in the per-level target sizes.
2022/01/21-20:23:51.893239 7ff4db3ff700 [/version_set.cc:3859] ZZZ set level_max_bytes: 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 300256990742 L6, 9 L7, 9 L8, 9 L9
2022/01/21-20:23:51.906933 7ff4db3ff700 [/version_set.cc:2687] ZZZ ECBN scores 84473373628 total, 84.5 totalB, 0.1 L0, 18.7 L1, 20.5 L2, 22.1 L3, 23.1 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:51.906951 7ff4db3ff700 [/version_set.cc:2703] ZZZ ECBN target 3 L0, 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.906954 7ff4db3ff700 [/version_set.cc:2716] ZZZ ECBN actual 141300841 L0, 6955042332 L1, 11844113742 L2, 21096345478 L3, 22752145110 L4, 46254106087 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.907149 7ff4db3ff700 [WARN] [/column_family.cc:925] [default] ZZZ Stopping writes because of estimated pending compaction bytes 84473373628
Steps to reproduce the behavior
A proof-of-concept diff to fix this is here. It computes level target from largest to smallest while the current code does it from smallest to largest.
Reproduction:
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1642906118
numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --duration=3600 --threads=16 --merge_operator="put" --seed=1642907605 --report_file=bm.lc.nt16.cm1.d0/v6.26.1/benchmark_overwrite.t16.s0.log.r.csv
For 2 workloads (cpu-bound, IO-bound) repeated with 3 configs comparing v6.26.1 without and with the PoC fix listed above. The cpu-bound workload uses the command-line above except the number of keys is 40M. The IO-bound workload uses the command-line above (800M keys). Both were run on a large server with fast SSD, 64G of RAM and 36 HW threads.
all use:
--level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30
x1 uses:
--soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904
x2 uses:
--soft_pending_compaction_bytes_limit=49392123904 --hard_pending_compaction_bytes_limit=98784247808
x3 uses
--soft_pending_compaction_bytes_limit=98784247808 --hard_pending_compaction_bytes_limit=197568495616
Below, old is upstream without changes and new has the PoC diff. The diff fixes the stalls with a small benefit to average throughput. The worst case stall without the fix is between 218 and 319 seconds for the IO-bound workloads.
--- cpu-bound
qps qps max stall usecs
old new old new
183449 261900 5651332 22779
170995 261053 161738 26681
176169 260813 133591 98030
--- IO-bound
qps qps max stall usecs
old new old new
104364 110613 239833382 98609
109937 113630 318710946 124163
109735 114530 218813592 99256
The graphs have the per-interval write-rate using 5-second intervals. The red line is RocksDB v6.26.1 with the PoC fix, the blue line is RocksDB v6.26.1 as-is.
For the x1 config and IO-bound workload
For the x1 config and CPU-bound workload
For the x2 config and IO-bound workload
For the x2 config and CPU-bound workload
Siying, Jay, and I discussed this a bit. I am probably the one slowing down the response to this issue due to being caught up in the philosophy of slowing down writes. Giving a quick update on where I am on that right now.
Take, for example, an arrival rate curve that would come close to violating limits on space, memory, or read-amp, but not quite reach those limits. The gradual slowdown phase is harmful to such workloads because it causes them to sleep while not protecting any limit. Worse, when arrival rate exceeds processing rate in the slowdown phase (as is likely to happen, otherwise we wouldn't be slowing down), the requests queue up and each one experiences progressively higher latencies, not just the latency of one sleep.
The graphs look much more stable with the proposed change. But does it mean RocksDB is more stable? I am not sure because each curve in the graphs shows a different arrival rate. I think it is important to fix the arrival rate (note: db_bench cannot do this) and look at the tail latencies. For example take this graph:
Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve with decent tail latencies? Or vice versa? I am not sure. All I know is the blue RocksDB could handle the blue curve smoothly, and the red one could handle the red curve smoothly.
One thing Siying mentioned today that stuck with me is maybe there are some systems that use RocksDB like a closed-loop benchmark (I suppose they have some control over arrival rate, e.g., by shedding excess requests to somewhere else). For those it indeed would be nice if we held a stable processing rate in face of pressure on the system. I need to think about this case further, and how it'd be done to not regress systems that have no control over arrival rate.
Some brief notes on past experiences:
- In Pebble we only had hard limits (discussion: cockroachdb/pebble#164 (comment)) in response to limit violations (space, memory, read-amp).
- Working with users, I have yet to hear user-visible problems that would be solvable by slowing down writes earlier. Domas wanted us to never stall.
WRT to gradual slowdowns being harmful, users can set the soft and hard limits to be close or far apart, or we could make RocksDB more clever about when to apply the soft limit.
The average throughput from the red curve was slightly higher than the blue curve. So I am confused by this.
Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve
with decent tail latencies?"
The red and blue curves will be largely unchanged for open-loop workloads, if made from the perspective of RocksDB. These were plotted from the client's perspective (client == db_bench) because the client makes it easier to get that info.
One achievable goal for a DBMS is to provide steady throughput from the DBMS perspective during overload, whether from open-loop or closed-loop workloads. RocksDB can't do that today and I think it should.
Working with users, I have yet to hear user-visible problems that would be solvable by slowing down
writes earlier. Domas wanted us to never stall.
But RocksDB does stall, badly. Whether or not the solution requires slowing writes in addition to stopping them, we need a solution that doesn't stop writes for 5 minutes during overloads.
The average throughput from the red curve was slightly higher than the blue curve. So I am confused by this.
Could the RocksDB that produced the red curve handle an arrival rate resembling the blue curve
with decent tail latencies?"
Sorry let me try to clarify. I believe a "workload" is essentially an arrival rate curve. In a closed-loop benchmark, I believe a curve in these graphs represents both the system's processing rate and its arrival rate (since a request is generated as soon as the previous request completes). Since the curves for "x4.io.old" and "x4.io.new" are not identical, there are two different workloads under consideration here.
The blue curve represents a spiky workload, and the red curve represents a smooth workload. Consider if we send the spiky workload to both systems. We know "x4.io.old" can handle it without tail latencies impacted by sleeps since it processed that exact workload before. But we do not know that for "x4.io.new". At points where the blue curve exceeds the red curve like the long spike at the beginning, it seems unlikely "x4.io.new" can process that fast. And if it falls behind, requests pile up, effectively causing requests to experience longer and longer delays even though writes aren't stopped.
The average throughput increasing due to adding sleeps is interesting. But the question I have about those cases is, is it a fundamental improvement, or is it preventing the LSM from entering a bloated state that we inefficiently handle? I have thought about these cases previously and predict/hope it is the latter, mostly because it feels with more compaction debt we have more choices of what to compact, lower fanout, etc., that should allow us to process writes more efficiently.
But RocksDB does stall, badly. Whether or not the solution requires slowing writes in addition to stopping them, we need a solution that doesn't stop writes for 5 minutes during overloads.
Agreed. My point has only been that slowdowns can cause pileups of requests that experience similar or worse latency to the stop condition (in open-loop systems only), so we need to carefully consider changes that might cause more slowdowns than before. Or make it clear when they're useful / how to opt out. Right now there are some that can't be disabled like slowing down when N-1 memtables are full (and N > 3 I think).
I agree that we should have workloads with 1) open-loop and 2) varying arrival rates and I am happy to help on the effort to create them.
I suspect you are being generous in stating that RocksDB can handle the blue curve. A write request that arrives during any of those valleys (stall periods) will wait several minutes. So the workload it can accept has to be the same or a subset of the blue curve. Maybe we are assigning a different value to avoiding multi-minute stalls.
RocksDB might be doing the right thing when increasing the compaction debt it allows, but it isn't doing the right thing when reducing it. WRT average throughput increasing by adding sleeps, the issue is that RocksDB can't make up its mind about what the per-level target sizes should be. It increases the targets slowly as the L0 grows, but then suddenly decreases them when the L0 shrinks after a large L0->L1 compaction. Persistent structures can't change their shape as fast as that code asks them to change.
Filed #9478 for the benchmark client that does open-loop and varying arrival rates
One scenario for a benchmark client that sustains a fixed arrival rate with a multi-minute write stall.
- Assume the client needs 10 threads to sustain the target rate
- And the client checks every 100 milliseconds to determine whether new threads are needed because existing ones are stalled
Then a stall arrives, and 10 threads are created each 100 milliseconds, or 100 threads/second and 6000 threads/minute. So after a 3-minute stall there will be ~20,000 threads assuming my server has enough memory to spawn them. It will be interesting to see what happens when that dogpile is slowly un-piled.
in sysbench we implemented a "queue" for events arriving with the target rate. we do not create extra threads, but just place arrived into "queue"... when the stall happens, the queue will grow, and then we need to decide at what moment the queue becomes too long to return the response in the acceptable response time window
in sysbench we implemented a "queue" for events arriving with the target rate. we do not create extra threads, but just place arrived into "queue"... when the stall happens, the queue will grow, and then we need to decide at what moment the queue becomes too long to return the response in the acceptable response time window
That is interesting but it keeps the server from seeing too many concurrent query requests which used to be an exciting event back in the day.
WRT to the start of the bug, perhaps I can restate my wishes if per-level target sizes can be dynamically increased when the L0 gets too large because the write rate is high:
-
I prefer an option to opt-out of that behavior. I want a stronger limit on space-amp and the current behavior means that space-amp can reach 2X (or even exceed it) when I expect it to be closer to 1.1X. There is no way to disable it today.
-
Decreasing the per-level target sizes should be gradual. It is sudden today and that is the source of long write stalls.
I repeated a benchmark using RocksDB version 6.28.2 both as-is and with a hack. The hack does two things: disables intra-L0 compaction, disables dynamic adjust of the per-level target sizes. The test is IO-bound (database larger than RAM) and was repeated on two server types -- one with 18 CPU cores, one with 40 CPU cores. I used 16 client threads for the 18-core server and 32 client threads for the 40-core server.
While I did the full set of benchmarks with tools/benchmark.sh I only share the results for readwhilewriting and overwrite. I modified the scripts to use --benchmarks=$whatever,waitforcompaction to make sure that the work to reduce compaction debt would be assigned to each benchmark
The summary is that for overwrite:
- throughput is similar with/without the hack
- worst-case response time is much better with the hack (1.5 vs 439.467 seconds, 0.162 vs 252.337 seconds)
- write-amp is larger with the hack (19.4 vs 16.6, 17.5 vs 13.6)
- compaction wall clock seconds are larger with the hack (51090 vs 31901, 41856 vs 26979)
Legend:
* qps - operations/sec. Measures Get/s for readwhilewriting and Put/s for overwrite
* wMB/s - average write MB/s per iostat
* cpu/s - average CPU/s per vmstat (us + sy columns)
* p99.99 - p99.99 response time in seconds for Get with readwhilewriting and Put for overwrite
* pmax - max response time in seconds for Get with readwhilewriting and Put for overwrite
* cpu/q - cpu/s / qps
* w-amp - write amplification = (flush + compaction) / ingest
* comp.w - compaction wall clock seconds from Comp(sec) column in compaction IO stats
--- res.iobuf, 18-core
readwhilewriting with 10MB/s writer
qps wMB/s cpu/s p99.99 pmax cpu/q
as-is 107365 174.1 25.6 0.0028 0.116 .000238
hack 107367 174.4 24.8 0.0028 0.109 .000230
overwrite with unlimited writers
qps wMB/s cpu/s p99.99 pmax cpu/q w-amp comp.w
as-is 107030 575.0 23.6 0.0188 252.337 .000220 13.6 26979
hack 100263 698.4 30.1 0.0165 0.162 .000300 17.5 41856
--- res.iobuf, 40-core
readwhilewriting with 10MB/s writer
qps wMB/s cpu/s p99.99 pmax cpu/q
as-is 213952 175.8 19.8 0.0012 0.022 .000092
hack 208109 189.8 19.8 0.0012 0.043 .000095
overwrite with unlimited writers
qps wMB/s cpu/s p99.99 pmax cpu/q w-amp comp.w
as-is 110440 696.2 17.5 0.126 439.467 .000158 16.6 31901
hack 115744 894.4 21.7 0.484 1.515 .000187 19.4 51090
These are compaction IO stats from the end of db_bench --benchmarks=overwrite,waitforcompaction,stats
--- 18-core, overwrite
- as-is
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 488.8 0.0 488.8 803.4 314.6 0.0 2.6 177.9 292.3 2814.45 2596.41 26449 0.106 1220M 212K 0.0 0.0
L3 3/0 48.27 MB 0.8 1195.2 314.6 880.6 1190.5 309.8 0.0 3.8 279.7 278.6 4375.27 4169.31 288 15.192 2989M 10M 0.0 0.0
L4 25/0 334.28 MB 1.0 414.0 234.8 179.2 411.5 232.3 75.0 1.8 247.9 246.4 1710.17 1460.89 10353 0.165 1035M 6179K 0.0 0.0
L5 226/0 2.69 GB 1.0 711.9 307.3 404.6 567.2 162.6 0.0 1.8 109.7 87.4 6644.46 6414.37 18788 0.354 2595M 35M 0.0 0.0
L6 1648/0 21.62 GB 1.0 529.6 161.7 367.9 500.8 132.9 0.9 3.1 93.3 88.2 5814.81 5652.66 10762 0.540 2390M 130M 0.0 0.0
L7 11977/0 173.03 GB 0.0 587.1 133.9 453.2 452.3 -1.0 0.0 3.4 107.0 82.4 5619.90 5467.70 8783 0.640 2691M 601M 0.0 0.0
Sum 13879/0 197.71 GB 0.0 3926.7 1152.3 2774.4 3925.6 1151.2 75.9 12.5 149.0 149.0 26979.06 25761.35 75423 0.358 12G 783M 0.0 0.0
- hack
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 3/0 43.37 MB 0.8 0.0 0.0 0.0 289.2 289.2 0.0 1.0 0.0 274.9 1077.29 895.43 20369 0.053 0 0 0.0 0.0
L3 4/0 57.61 MB 0.9 528.5 289.1 239.4 528.1 288.8 0.0 1.8 261.4 261.2 2070.61 1908.04 3130 0.662 1319M 270K 0.0 0.0
L4 23/0 344.10 MB 1.0 1030.5 287.2 743.4 1030.0 286.7 1.6 3.6 256.9 256.8 4107.93 3602.62 19039 0.216 2575M 910K 0.0 0.0
L5 213/0 2.69 GB 1.0 1612.5 288.3 1324.2 1480.7 156.5 0.0 5.1 95.5 87.7 17286.75 16703.24 19599 0.882 6696M 14M 0.0 0.0
L6 1620/0 21.63 GB 1.0 874.5 156.4 718.1 850.7 132.6 0.1 5.4 87.5 85.1 10239.91 9864.87 11159 0.918 3947M 107M 0.0 0.0
L7 11788/0 173.04 GB 0.0 675.7 132.7 543.0 542.6 -0.4 0.0 4.1 97.8 78.5 7073.34 6736.39 9146 0.773 3101M 594M 0.0 0.0
Sum 13651/0 197.79 GB 0.0 4721.7 1153.6 3568.1 4721.3 1153.2 1.7 16.3 115.5 115.5 41855.82 39710.60 82442 0.508 17G 717M 0.0 0.0
--- 40-core, overwrite
- as-is
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 501.0 0.0 501.0 819.4 318.4 0.0 2.6 171.9 281.1 2985.09 2573.86 27224 0.110 1250M 45K 0.0 0.0
L2 3/0 48.28 MB 0.8 1365.2 318.4 1046.8 1363.9 317.1 0.0 4.3 318.3 318.0 4391.70 4287.22 310 14.167 3412M 2295K 0.0 0.0
L3 13/0 209.03 MB 1.0 418.8 232.3 186.5 418.2 231.7 84.8 1.8 285.4 285.0 1502.61 1387.38 11087 0.136 1047M 1335K 0.0 0.0
L4 165/0 1.69 GB 1.0 745.9 316.5 429.4 603.0 173.6 0.0 1.9 128.3 103.8 5950.97 5837.25 18605 0.320 2728M 8153K 0.0 0.0
L5 1051/0 13.56 GB 1.0 363.6 152.1 211.5 360.4 149.0 21.5 2.4 107.5 106.6 3461.65 3395.57 9220 0.375 1640M 13M 0.0 0.0
L6 8127/0 108.52 GB 1.0 663.2 170.1 493.2 642.5 149.4 0.2 3.8 107.9 104.5 6293.90 6148.99 11212 0.561 2994M 93M 0.0 0.0
L7 57094/0 868.18 GB 0.0 838.4 147.6 690.8 704.6 13.7 0.2 4.8 117.4 98.6 7315.33 7098.31 9852 0.743 3825M 572M 0.0 0.0
Sum 66453/0 992.20 GB 0.0 4896.1 1336.9 3559.2 4912.1 1352.9 106.7 15.4 157.2 157.7 31901.25 30728.57 87510 0.365 16G 691M 0.0 0.0
- hack
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 334.1 334.1 0.0 1.0 0.0 155.5 2200.43 1074.00 22701 0.097 0 0 0.0 0.0
L2 4/0 58.03 MB 0.9 568.9 334.1 234.8 568.5 333.7 0.0 1.7 288.4 288.1 2020.34 1929.23 3393 0.595 1419M 59K 0.0 0.0
L3 14/0 209.12 MB 1.0 782.6 298.5 484.1 782.5 298.4 35.2 2.6 295.4 295.4 2712.68 2580.82 19220 0.141 1954M 88K 0.0 0.0
L4 130/0 1.69 GB 1.0 1791.7 333.6 1458.1 1642.7 184.6 0.0 4.9 113.4 103.9 16184.93 15876.30 21644 0.748 7401M 1670K 0.0 0.0
L5 1032/0 13.52 GB 1.0 1019.0 184.5 834.5 1015.8 181.4 0.1 5.5 103.1 102.8 10117.30 9907.30 13113 0.772 4595M 12M 0.0 0.0
L6 8119/0 108.27 GB 1.0 1012.5 181.0 831.4 990.4 159.0 0.0 5.5 102.7 100.4 10099.69 9909.91 12707 0.795 4572M 99M 0.0 0.0
L7 66668/0 866.31 GB 0.0 861.3 155.6 705.8 728.4 22.6 0.6 4.7 113.7 96.2 7754.44 7594.22 10891 0.712 3925M 563M 0.0 0.0
Sum 75967/0 990.05 GB 0.0 6036.0 1487.3 4548.7 6062.4 1513.7 35.8 18.1 121.0 121.5 51089.82 48871.78 103669 0.493 23G 677M 0.0 0.0
So from the results above, can we get the efficiency benefits of the current code without the worst-case write stall?
Contributing to @mdcallag 's findings here, I can confirm that we are also observing with many versions of RocksDB (including 7.2) that for some types of constants input loads RocksDB can suddenly go into minute-long write stalls.
We have seen GetDelayTime()
calculating write stall times of more than 20 minutes.
IMO such erratic behavior is bad from the user's perspective: requests to RocksDB work fine most of the time, but suddenly the processing time can go through the roof because of unpredictable write stalls. The tail latency times are way too high then. From the user perspective, I would also favor a more even throughput rate, even though it is not maximum. But it is much better to get constant ingestion rates rather than ingestion requests failing suddenly because of too high tail latencies.
We have put our own write-throttling mechanism in front of RocksDB to avoid the worst stalls, but ideally RocksDB should gracefully handle inputs and try to achieve a smooth input rate using the configured slowdown trigger values.
Would be very helpful if this issue could be followed up. Thanks!
@jsteemann - currently under discussion at #10057
@mdcallag : thanks, this is very useful! And thanks for putting a lot of time into benchmarking the current (and adjusted) behaviors!
I hope to use the db_bench tool to verify the effect of this repair. What workloads should I use to verify this fact. This is my current test.
/db_bench --benchmarks=fillseq --db=/rocksdb-data --wal_dir=/rocksdb-data --key_size=64 --value_size=512 --num=40000000 --threads=16 --initial_auto_readahead_size=16384 --max_auto_readahead_size=65536 --block_size=32768 --level0_file_num_compaction_trigger=2 --compaction_readahead_size=524288 --level0_slowdown_writes_trigger=36 --level0_stop_writes_trigger=48 --write_buffer_size=134217728 --target_file_size_base=134217728 --max_bytes_for_level_base=536870912 --use_direct_reads=true --bloom_bits=10 --compression_type=none --batch_size=32