Potential crash-consistency bug that violates "no hole in writes" assumption
Opened this issue · 0 comments
Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev
Hi RocksDB developers and maintainers, we found a potential crash consistency bug while testing on the RocksDB db_stress_tool
using our crash consistency bug detecting tool under development. For your reference, we save the trace and operation files in this link.
During normal execution, Rocksdb would buffer the writes in memory and later persist them to disk. According to this article, it should guarantee there are no holes between the recovered write if there is a crash in the middle of persisting the writes. However, we discovered a case that violates the assumption.
We will explain the case with the following trace files.
tracer.log
: saves the syscall trace fromdb_stress_tool
subgraph_x.png
: demonstrates the potential persistence order of syscalls.6_0.csv
,7_0.csv
,8_0.csv
: results from the verification process from permutations of syscalls that trigger the crash consistency bug.opfile.log
saves all the workload operations executed.
Our crash-consistency testing tool collects a syscall trace with optional operation begin/end markers provided from the workload.
And we modified the db_stress_tool
to include the markers for the operations (e.g. write key-value, delete key-value...).
In the tracer.log
, the problem occurs at the following snippet, which is simplified for readability.
The offset is also adjusted.
op_id, thread_id, workload_op_type, workload_thread_id, workload_op_id
27612,38,OP_BEGIN,0,8995
27613,38,WRITE,27, ....
27614,38,OP_END,0,8995
27615,38,OP_BEGIN,0,8996
27616,38,WRITE,27, ....
27617,38,OP_END,0,8996
27618,38,OP_BEGIN,0,8997
27619,38,WRITE,27, ...
27620,38,OP_END,0,8997
27621,38,OP_BEGIN,0,8998
27622,38,WRITE,27, ...
27623,38,OP_END,0,8998
27624,38,OP_BEGIN,0,8999
27625,38,WRITE,27, ...
27626,38,OP_END,0,8999
In a sequence of write syscalls to the disk, it may be the case that the writes can be reordered in terms of persistence order if they are on different blocks. In this case, ops 27616, 27619 and op 27622 are on different blocks and can be reordered.
We provide a figure of the persistence graph outputted from our crash-consistency testing tool in the folder. The persistence graph represents operations as nodes and persistence dependencies as edges. The persistence graph presents the order of each operation as follows
(row_1) op 27613 --> op 27616 --> op 27619
(row_2) op 27622 --> op 27625
Expected Behavior
In single-threaded workload, if any operation of row_1
was not persisted, the writes in row_2
would not be persisted.
Actual Behavior
For each of the csv
files, it documents the operations and the orders applied by our tool, as well as the output result from db_stress_tool
verification process.
(#1) 6_0.csv 27613 --> 27616 --> 27622
exit_status: 1, err_msg: key from 27622 not found
(#2) 7_0.csv 27613 --> 27616 --> 27622 --> 27625
exit_status: 1, err_msg: key from 27622 and 27625 not found
(#3) 8_0.csv 27613 --> 27622
exit_status: 0
In both #1
and #2
, operation 27619
was not persisted. Therefore, it is expected that writes from row_2
are missed or not found.
However, for case #3
, although operations 27616
and 27619
are not persisted, the verification succeeds, indicating the key-value written by operation 27622
is found in the database. This violates the no hole assumption in RocksDB.
Steps to reproduce the behavior
- Git clone our version of RocksDB at this link. Checkout to
rocksdb-bug-1
branch. - Build
db_stress_tool
. - Use a tool that could replay and reorder syscall trace and reproduce the crash states identified above in the
csv
files. - Run the verification process with the following command:
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000 --threads=1 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=100000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --target_file_size_base=2097 --target_file_size_multiplier=2 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35 --db={{ database_path }} --opfile_path=opfile.log --squint_mode=checker --destroy_db_initially=0 --clear_column_family_one_in=0 --threads=1 --expected_values_dir={{ database_path }} --ops_completed_path={{ database_path }}/ops_completed
Linux Distribution
Ubuntu 22.04.4 LTS
File Systems
ext4