rocksdb using all allocated cpus due to contention on block cache

Question

rocksdb using all allocated cpus due to contention on block cache

Opened this issue 16 days ago · 3 comments

zaidoon1 commented 16 days ago

graphs:

flamegraph:

workload:

heavy prefix lookups (thousands per second) to check if a key prefix exists in the db

writes at a much much lower rate, around 200 RPS

Db size on disk: less than 2GB

rocksdb settings:

using prefix extractors + auto hyper clock cache + running rocksdb 9.7.4.

rocksdb options.txt

This is an extension of #13081 where I saw the same issue and blamed it on LRU so I switched to auto hyper clock cache and ran some tests which seemed to not repro the issue but it doesn't appear to be the case here.

It is very possible that many lookups are using the same prefix/looking up the same key. Would this cause contention for hyper clock cache? Is there something that I can tweak/tune? Maybe the "auto" hyper clock cache is the problem and I need to manually tweak some things?

Answer 1 · 2024-12-06T07:32:14.000Z

I was watching https://www.youtube.com/watch?v=Tp9jO5rt7HU and it seems I may be running into this case:

That video is from 1 year ago, I'm not sure if things have changed since.

In my case, my block cache is set to 256mb which i assume is considered "small". Db size is around 1.5GB but that's compressed and my kvs get compressed very well, uncompressed would be much much larger (there is like 10M+ kvs in the db)

a few questions:

is there a metric that we can track that can show whether hyperclock cache is hitting this case of looking for things to evict or not? It's not showing up in the flamegraph but maybe it's not meant to show up???
if it is indeed this case or if let's assume it is the case, what is the solution/workaround?

@pdillinger what are your thoughts on this?

Answer 2 · 2024-12-11T20:14:30.000Z

I don't see any significant block cache indicators on the flame graph. I can't zoom into the names that are cut short. This looks more like a workload of excessive skipped internal keys (e.g. skipping over tombstones to find something). Are you using prefix_same_as_start or an iterate_upper_bound for your prefix queries? You don't want to be scanning to the next non-empty prefix just to discover the prefix you are interested in is empty.

What makes you think block cache?

Answer 3 · 2024-12-12T03:31:44.000Z

Are you using prefix_same_as_start

yes.

What makes you think block cache?

I'm actually confused right now. The initial problem started with #13120

where something outside of my control deletes all files on disk that are being indexed in rocksdb. Then a clean up service that is meant to remove orphaned indexes runs, deletes pretty much the entire db as it notices the indexes don't point to anything on disk.

This typically happens when not all services running on the disk are online in the sense that nothing is writing to rocksdb but there are many reads (the prefix lookups), however, sometimes the rest of the services are enable but rocksdb has reached a bad state and continue like that for a while. I also learned that the ttl compaction that I have won't trigger automatically and that they rely on flushes/writes to trigger the compaction. so the advice from the other github issue i linked is to have my clean up service issue manual compactions after it's done removing all the keys to avoid the case we are hitting here.

for example, here is a recent occurrence:

something misbehaves and deletes everything on disk

clean up service runs to delete orphaned indexes

after each run, clean up service issues a manual compaction
here is others graphs showing different stats of rocksdb at the time

Based on the graphs, you can see we don't have any accumulation of tombstones during the time we had the cpu spike. The only thing that spikes with the cpu is block cache related metrics and that's the only reason why I'm suspecting block cache even though from the flamgraph itself, it looks like it should be tombstones. Also as I said in the other linked ticket, sometimes waiting a few hours/days will fix it, other times i give up on it recovering and restart the service/rocksdb in which case it also fixes it.

as for how i'm getting number of tombstones, etc.. I call https://github.com/zaidoon1/rust-rocksdb/blob/f22014c5f102744c8420d26d6ded90f340fb909c/src/db.rs#L2326-L2327.

tombstones = num_deletions, live keys = num_entries and I sum that from all live files. I assume that's accurate