facebook/rocksdb

compaction not running or running very slowly when entire db is deleted?

zaidoon1 opened this issue · 6 comments

so for my service, there are times when something happens that causes pretty much the entire db to be deleted using individual delete requests. I'm aware that sending individual delete requests to delete the entire db (instead of a range delete for example) will fill the db with tombstones and pretty much degrade the performance of rocksdb severely as we can see below. What I don't understand is why does it take days for compaction to run/finish compacting all of the tombstones away. As we can see there is not much activity and compaction is not running, I've seen this issue happen in the past and it took a few days for rocksdb to recover/go back to normal after an event like this (deleting pretty much the entire db)

Screenshot 2024-11-06 at 1 41 02 PM Screenshot 2024-11-06 at 1 39 40 PM Screenshot 2024-11-06 at 1 39 15 PM Screenshot 2024-11-06 at 1 38 30 PM

rocksdb config:

OPTIONS-000007.txt

I'm running latest rocksdb version, pretty much default settings, however, I do have db ttl set to a few hours which is more confusing since the docs say:

Leveled: Non-bottom-level files with all keys older than TTL will go through the compaction process. This usually happens in a cascading way so that those entries will be compacted to bottommost level/file. The feature is used to remove stale entries that have been deleted or updated from the file system.

My understanding is that after the entire db is deleted, once the ttl is up, compaction would run and compact the entire db to get rocksdb back to a normal state.

cbi42 commented

there is not much activity and compaction is not running

After the entire DB is deleted, is there any more writes to the DB? RocksDB doesn't have a timer-based trigger to check if there's eligible compactions. Usually it tries to schedule a compaction after a flush/compaction. I suspect that there's nothing to trigger compaction to be scheduled. Can you do a manual compaction after you issue all the deletions?

After the entire DB is deleted, is there any more writes to the DB?

there is writes, but my workload is extremely read heavy. The other problem is that the service started using 10 cpus and that is the max number of cpus allocated for the service using cgroups so it started being throttled and made things worse.

A few questions:

  1. would CompactOnDeletionCollector help here?
  2. we can see the number of sst files go down, this means compaction did happen right? Note that this metric is powered by me getting the number of "live files"
  3. I feel like running manual compaction wouldn't work here if the moment the db size went down after all the deletes, rocksdb started using all cpus, started getting throttled and pretty much deadlocked?

another data point, the last time this happened, a simple db restart fixed the issue and everything went back to normal

cbi42 commented

a simple db restart fixed the issue and everything went back to normal

Do you hold snapshot for a very long time? Restart will clear the snapshots. Tombstone can be in the last level files if there's a snapshot preventing them to be dropped.

would CompactOnDeletionCollector help here?

If the problem is that files with many tombstones are not being compacted down to the last level, then yes. It won't help if there's snapshot keeping tombstones ailve.

I feel like running manual compaction wouldn't work here if the moment the db size went down after all the deletes, rocksdb started using all cpus, started getting throttled and pretty much deadlocked?

The hope is to compact away the tombstones with manual compaction so that iterators won't use this much CPU.

Do you hold snapshot for a very long time?

my delete workflow is as follows:

  1. create a checkpoint
  2. have some separate service/process open the checkpoint (read only mode)
  3. iterate over all the kvs, check if they are stale, if they are stale, send a delete request to the service that has rocksdb open in read/write mode and move to the next kv.
  4. delete checkpoint directory

Is checkpoint and snapshot the same thing here?

Given that, does rocksdb consider the checkpoint to be "held" the entire time?

Should I update my clean up service to do something like:

  1. get estimate of number of kvs in the db
  2. iterate over all the kvs, check if they are stale, if they are stale, send a delete request to the service that has rocksdb open in read/write mode and move to the next kv.
  3. if ratio of deleted keys to estimated number of kvs is X, then stop trying to clean up.
  4. delete checkpoint directory
  5. trigger manual compaction
  6. sleep for x minutes
  7. create checkpoint again, and start processing again
cbi42 commented

Checkpoint is different from snapshot, which can be created by GetSnapshot(). Manual compaction should help.

To figure out why ttl compaction was not run, you can dump the SST files ./sst_dump --command=raw --show_properties --file=/... and check for this table property:

// Oldest ancester time. 0 means unknown.

Manual compaction should help.

got it, thank you!