facebook/rocksdb

SSTs overlap with compactions should be able to be ingested into deeper levels

v01dstar opened this issue · 2 comments

Expected behavior

When ingesting external SST files. If a SST overlaps with some registered compactions, it should NOT be considered overlap_with_db, but overlap_with_level. In other words, it should still be able to go down the levels, check if deeper levels can be selected for this file. Because, a SST does not overlap with previous levels but overlaps with a registered compaction means they do not have duplicate keys, thus safe to be ingest to a different, deeper level who does not have overlap ranges.

I believe this is also the expected behavior before #10988, the PR changed this behavior, although, I suspect, the change was not the purpose the PR.

Actual behavior

SSTs overlap with compactions are ingested into level 0. Triggers write stall.

for (int lvl = 0; lvl < exclusive_end_level; lvl++) {
if (lvl > 0 && lvl < vstorage->base_level()) {
continue;
}
if (cfd_->RangeOverlapWithCompaction(file_to_ingest->start_ukey,
file_to_ingest->limit_ukey, lvl)) {
// We must use L0 or any level higher than `lvl` to be able to overwrite
// the compaction output keys that we overlap with in this level, We also
// need to assign this file a seqno to overwrite the compaction output
// keys in level `lvl`
overlap_with_db = true;
break;
} else if (vstorage->NumLevelFiles(lvl) > 0) {

Steps to reproduce the behavior

We found this issue during backfilling data using ingestions. The application ensures the data to be back filled does not overlap with existing data. But still got ingested into level 0. Presumably, some of the files overlaps with a ongoing compaction at high level, e.g. SST with range "[10 - 20)" was considered over_with_db, because a level1 compaction was trying to compact "[0,10)" and "[20-30)".

A simple fix would be: main...v01dstar:rocksdb:fix-ingest-lvl

Would like to hear opinions from the maintainers.

Because, a SST does not overlap with previous levels but overlaps with a registered compaction means they do not have duplicate keys

One case I'm thinking this may not hold is when we move files from lower levels to higher levels. This can happen when doing a manual compaction with target level that is smaller than the output level.