basho/bitcask

Race makes opening Bitcask dir impossible

Closed this issue · 2 comments

I have reproduced a problem where Bitcask gets stuck, unable to re-open a cask in a certain directory.

On open, first a keydir object is created, but not marked as ready. Then the files are scanned to populate it, at which point it is marked ready here bitcask.erl#L1248 and things are good. However, if the scan errors out, we hit this branch instead in bitcask.erk#1244, which does not mark the keydir as ready, but leaves it behind in that state. Calling open again on the same directory finds this existing keydir, but detects it is not ready, so tries to wait for it to load in bitcask.erl#L1252, eventually timing out.

When the error happens, the newly created keydir should probably be released.

Now, the fact that the error happens on scan might lead to a different bug. What has been observed is that the function to list files in a directory, bitcask_fileops:list_dir/1 returns {error, einval}, which is not handled in bitcask_fileops:data_file_tstamps/1, causing the error that leads to the stuck keydir. Notice how this function is trying to avoid a call to the file server by calling the efile port directly, which might be part of the reason. I'm currently investigating the exact sequence of events that leads to this.

The issue that caused the scan to fail has been filed separately here: #188. We should fix both sides: scans shouldn't fail, but any failure shouldn't result in an unusable keydir.

Fixed in the 1.7 branch by #190. A separate issue will track the merging of all fixes for the Riak 2.0.1 release into the develop branch