basho/bitcask

Delete found in fold operation (counterexample Cp8)

Opened this issue · 4 comments

Here's counterexample Cp8:

Cp8 = [[{set,{var,1},{call,bitcask_pulse,bc_open,[true,{true,{62,59,175},26}]}},{set,{var,2},{call,bitcask_pulse,puts,[{var,1},{2,3},<<188,81,240,255,122,185,227,8,104,196,56,142,250,34,8>>]}},{set,{var,6},{call,bitcask_pulse,bc_close,[{var,1}]}},{set,{var,10},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,1},{not_call,bitcask_pulse,bc_open,[false,{true,{282,224,221},100}]}},{set,{not_var,2},{not_call,bitcask_pulse,fold,[{not_var,1}]}},{set,{not_var,3},{not_call,bitcask_pulse,fold,[{not_var,1}]}}]]}},{set,{var,12},{call,bitcask_pulse,bc_open,[true,{true,{270,229,202},55}]}},{set,{var,18},{call,bitcask_pulse,delete,[{var,12},2]}},{set,{var,19},{call,bitcask_pulse,merge,[{var,12}]}},{set,{var,20},{call,bitcask_pulse,fork_merge,[{var,12}]}},{set,{var,23},{call,bitcask_pulse,puts,[{var,12},{3,43},<<0,0,0,0>>]}},{set,{var,25},{call,bitcask_pulse,fold,[{var,12}]}},{set,{var,26},{call,bitcask_pulse,bc_close,[{var,12}]}},{set,{var,29},{call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},{set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},{set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],{43245,66780,7696},[{events,[]}]].
[begin Now = {1405,406317,535184}, io:format("Now ~w ", [Now]), true = eqc:check(bitcask_pulse:prop_pulse(), [lists:nth(1,Cp8), Now, lists:nth(3,Cp8)]) end || _ <- lists:seq(1,100)].

That seed of {1405,406317,535184} is nice & deterministic on my Mac. YMMV, substitute now() in its place and run until you find something that fails very consistently.

The last three ops of the main thread are:

  {set,{var,29},
       {call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},
  {set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},
  {set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],

The failure is key 5 should not be found by step 36's fold, but it is.

Bad:
[{0,410681,[]},
 {410681,infinity,
  [{bad,<<"pid_1">>,
        {fold,[{2,[not_found]},
               {3,[<<0,0,0,0>>]},
               {4,[<<0,0,0,0>>]},
               {5,[not_found]},
[...]
               {43,[<<0,0,0,0>>]}],
              [{3,<<0,0,0,0>>},
               {4,<<0,0,0,0>>},
               {5,<<0,0,0,0>>},
[...]

More research required. There's a possible race with a forked merge and/or with merge at open time via bitcask:make_merge_file().

This isn't a merge race, it's a model problem checking with folds that truly freeze the keydir.

  1. Fork a fold that runs for a while.
  2. Do a lot of puts so that the keydir is frozen at epoch 47.
  3. Do some more puts while folding a couple more times. The epoch counter creeps up to 50.
  4. Do a delete. The thing that we're deleting is inside file 5. The keydir_remove deletes the key in the keydir at epoch 50.
  5. A new fold starts (by the same pid that did the delete in previous step, FWIW), bitcask:open_fold_files() starts executing. The keydir epoch is now 51, but because we're really frozen, the keydir's pending_start_epoch 47.
  6. After all fold files are open, yes indeed, we're using a folding epoch of 47.
  7. When folding, and we're folding file 5, we query the keydir for our key, and because the delete at epoch 51 isn't visible, the keydir says that file 5's entry is current. Thus the key appears in the fold results by pid P when P had deleted that key just moments earlier.

Good to hear it's a test model issue. I'm actually surprised we haven't seen more of these now that the keydir can freeze and cap the visibility of folds at hard to predict times without the EQC model accounting for it. shrug

I used to have such a check and would disregard results that happened during a known-frozen test case. sigh I haven't decided if I want to resurrect it or not. The model already knows that multiple folds are happening, which is a prerequisite for a frozen keydir.

Ping ... milestone and code review are still pending.