basho/bitcask

A bitcask file can get merged repeatedly [JIRA: RIAK-1844]

Opened this issue · 8 comments

A bitcask file is not deleted immediately after a merge, but handed off to bitcask_merge_delete for a deferred delete. In the meantime it is marked by turning on its setuid bit.

Calling bitcask:merge/11 merges readable files, which list does not include those marked for deletion2.

However, the riak_kv_bitcask_backend starts a merge by specifying the exact list of files to merge. And that list comes from bitcask:needs_merge/13 that does not filter out files marked for a deletion.

It means that if a file is not deleted in 3 minutes after it's merged, it will be enlisted for the next merge too. This issue shouldn't happen too often, but is magnified by the delay of all merging until the merging window, when a huge amount of merging activity can suddenly begin. Also, bitcask_merge_delete is a per node (not per vnode) process, so a long running fold operation in any of the vnodes (e.g. MDC replication) may block the deferred delete queue for a long time.

The result is an unnecessary use of disk IO and CPU, but there is no other risk e.g. no chance for data loss.

This problem was noticed on Riak 1.4.8, and although a lot has changed in 1.4.12 it looks like to me the issue may still be present.

I am not sure whether Riak 2.0.x versions are affected or not.

Thanks for the report and the analysis @dszoboszlay. I'm the engineer who looked at your assessment in the Zendesk ticket and I agree with it. Bitcask in 2.0.1+ has more knobs to limit the amount of disk bytes to merge that would avoid massive merge spikes and make this less of a problem in many cases. We should have a fix soon that prevents the merge process from merging a file already marked for deletion, as well as filter those out early when needs_merge runs the heuristic to determine which files to put in the merge queue.

You may also consider speeding up the delete process by tweaking bitcask_merge_delete a bit. If the first delete request cannot be performed right now, it may still worth trying the rest of the requests in the queue. This way a slow fold in one partition wouldn't block deletes in other partitions.

There is one delete process per partition, they don't interfere with each other @dszoboszlay

Are you sure? The bitcask_merge_delete servers do register themselves locally, so there can be only one such server running per Erlang node.

Yes, you are right @dszoboszlay, I hadn't noticed. That is indeed suboptimal.

This behavior has also been seen in the following zendesk ticket : https://basho.zendesk.com/agent/tickets/9308

@engelsanchez FYI seeing this on another ticket (1.4.12 installation) https://basho.zendesk.com/agent/tickets/11440