facebook/rocksdb

Potential crash-consistency bug that violates "no hole in writes" assumption

Opened this issue · 0 comments

IKACE commented

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Hi RocksDB developers and maintainers, we found a potential crash consistency bug while testing on the RocksDB db_stress_tool using our crash consistency bug detecting tool under development. For your reference, we save the trace and operation files in this link.

During normal execution, Rocksdb would buffer the writes in memory and later persist them to disk. According to this article, it should guarantee there are no holes between the recovered write if there is a crash in the middle of persisting the writes. However, we discovered a case that violates the assumption.

We will explain the case with the following trace files.

  1. tracer.log: saves the syscall trace from db_stress_tool
  2. subgraph_x.png: demonstrates the potential persistence order of syscalls.
  3. 6_0.csv,7_0.csv,8_0.csv: results from the verification process from permutations of syscalls that trigger the crash consistency bug.
  4. opfile.log saves all the workload operations executed.

Our crash-consistency testing tool collects a syscall trace with optional operation begin/end markers provided from the workload.
And we modified the db_stress_tool to include the markers for the operations (e.g. write key-value, delete key-value...).
In the tracer.log, the problem occurs at the following snippet, which is simplified for readability.
The offset is also adjusted.

op_id, thread_id, workload_op_type, workload_thread_id, workload_op_id
27612,38,OP_BEGIN,0,8995
27613,38,WRITE,27, ....
27614,38,OP_END,0,8995
27615,38,OP_BEGIN,0,8996
27616,38,WRITE,27, ....
27617,38,OP_END,0,8996
27618,38,OP_BEGIN,0,8997
27619,38,WRITE,27, ...
27620,38,OP_END,0,8997
27621,38,OP_BEGIN,0,8998
27622,38,WRITE,27, ...
27623,38,OP_END,0,8998
27624,38,OP_BEGIN,0,8999
27625,38,WRITE,27, ...
27626,38,OP_END,0,8999

In a sequence of write syscalls to the disk, it may be the case that the writes can be reordered in terms of persistence order if they are on different blocks. In this case, ops 27616, 27619 and op 27622 are on different blocks and can be reordered.

We provide a figure of the persistence graph outputted from our crash-consistency testing tool in the folder. The persistence graph represents operations as nodes and persistence dependencies as edges. The persistence graph presents the order of each operation as follows

(row_1) op 27613 --> op 27616 --> op 27619
(row_2) op 27622 --> op 27625

Expected Behavior

In single-threaded workload, if any operation of row_1 was not persisted, the writes in row_2 would not be persisted.

Actual Behavior

For each of the csv files, it documents the operations and the orders applied by our tool, as well as the output result from db_stress_tool verification process.

(#1) 6_0.csv 27613 --> 27616 --> 27622 
exit_status: 1, err_msg: key from 27622 not found

(#2) 7_0.csv 27613 --> 27616 --> 27622 --> 27625 
exit_status: 1, err_msg: key from 27622 and 27625 not found

(#3) 8_0.csv 27613 --> 27622 
exit_status: 0

In both #1 and #2, operation 27619 was not persisted. Therefore, it is expected that writes from row_2 are missed or not found.

However, for case #3, although operations 27616 and 27619 are not persisted, the verification succeeds, indicating the key-value written by operation 27622 is found in the database. This violates the no hole assumption in RocksDB.

Steps to reproduce the behavior

  1. Git clone our version of RocksDB at this link. Checkout to rocksdb-bug-1 branch.
  2. Build db_stress_tool .
  3. Use a tool that could replay and reorder syscall trace and reproduce the crash states identified above in the csv files.
  4. Run the verification process with the following command:
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000 --threads=1 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=100000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --target_file_size_base=2097 --target_file_size_multiplier=2 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35 --db={{ database_path }} --opfile_path=opfile.log --squint_mode=checker  --destroy_db_initially=0 --clear_column_family_one_in=0 --threads=1 --expected_values_dir={{ database_path }} --ops_completed_path={{ database_path }}/ops_completed

Linux Distribution

Ubuntu 22.04.4 LTS

File Systems

ext4