Assertion failure in cluster.py
1-p opened this issue · 4 comments
if you run with python run.py 903
you may encounter an assertion failure.
AssertionError: next slot to commit is already decided
@MichaelDiBernardo the output is here https://gist.github.com/1-p/01ff5fe68e81c7d11bcf
(I added some code to print the internal state at the end when assertion failed.)
Somehow do_Decision
is called with slot = 27
when decisions contains 0-29.
slot
is the next slot that the replica wants to commit, but does not yet have a decision for. The while
loop in do_Decision
should take care of ensuring this invariant, by looping until self.decisions.get(self.slot)
is false. In developing the code, I've seen this happen when decisions
dictionaries are aliased between components -- that is, when two different nodes are using the same Python dictionary. Then one node adds key 27
to the dictionary and increments its own slot
, and another component comes along and is surprised to see 27
in the dictionary.
So I see N2 welcome N0 in a state where slots 1-26 are decided, and slot is 27. Everything still makes sense here
N2 - T=1002.835 sending Welcome(state={'a': 20, 'c': 20, 'b': 20, 'e': 20, 'd': 30, 'g': 20, 'f': 20}, slot=27, decisions={1: Proposal(caller='N6', client_id=100000, input=('get', 'd')), 2: Proposal(caller='N6', client_id=100003, input=('get', 'g')), 3: Proposal(caller='N6', client_id=100005, input=('get', 'f')), 4: Proposal(caller='N6', client_id=100004, input=('get', 'e')), 5: Proposal(caller='N6', client_id=100002, input=('get', 'b')), 6: Proposal(caller='N6', client_id=100006, input=('get', 'a')), 7: Proposal(caller='N6', client_id=100001, input=('get', 'c')), 8: Proposal(caller='N6', client_id=100007, input=('set', 'd', 10)), 9: Proposal(caller='N6', client_id=100008, input=('set', 'g', 10)), 10: Proposal(caller='N6', client_id=100009, input=('set', 'f', 10)), 11: Proposal(caller='N6', client_id=100012, input=('set', 'a', 10)), 12: Proposal(caller='N6', client_id=100013, input=('set', 'c', 10)), 13: Proposal(caller='N6', client_id=100010, input=('set', 'e', 10)), 14: Proposal(caller='N6', client_id=100011, input=('set', 'b', 10)), 15: Proposal(caller='N6', client_id=100014, input=('get', 'd')), 16: Proposal(caller='N6', client_id=100015, input=('get', 'g')), 17: Proposal(caller='N6', client_id=100016, input=('get', 'f')), 18: Proposal(caller='N6', client_id=100017, input=('get', 'a')), 19: Proposal(caller='N6', client_id=100018, input=('get', 'c')), 20: Proposal(caller='N6', client_id=100019, input=('get', 'e')), 21: Proposal(caller='N6', client_id=100020, input=('get', 'b')), 22: Proposal(caller='N6', client_id=100021, input=('set', 'd', 20)), 23: Proposal(caller='N6', client_id=100022, input=('set', 'g', 20)), 24: Proposal(caller='N6', client_id=100023, input=('set', 'f', 20)), 25: Proposal(caller='N6', client_id=100024, input=('set', 'a', 20)), 26: Proposal(caller='N6', client_id=100025, input=('set', 'c', 20)), 28: Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)), 29: Proposal(caller='N6', client_id=100028, input=('set', 'd', 30))}) to ['N0']
It then gets a decision for slot 27 and commits slots 27-29:
N2.Replica - T=1002.836 received Decision(slot=27, proposal=Proposal(caller='N6', client_id=100026, input=('set', 'e', 20))) from N6
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100026, input=('set', 'e', 20)) at slot 27
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)) at slot 28
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100028, input=('set', 'd', 30)) at slot 29
Before N0 receives the Welcome, which now has slot 27 filled:
N0.Bootstrap - T=1002.860 received Welcome(state={'a': 30, 'c': 30, 'b': 20, 'e': 20, 'd': 30, 'g': 30, 'f': 30}, slot=27, decisions={1: Proposal(caller='N6', client_id=100000, input=('get', 'd')), 2: Proposal(caller='N6', client_id=100003, input=('get', 'g')), 3: Proposal(caller='N6', client_id=100005, input=('get', 'f')), 4: Proposal(caller='N6', client_id=100004, input=('get', 'e')), 5: Proposal(caller='N6', client_id=100002, input=('get', 'b')), 6: Proposal(caller='N6', client_id=100006, input=('get', 'a')), 7: Proposal(caller='N6', client_id=100001, input=('get', 'c')), 8: Proposal(caller='N6', client_id=100007, input=('set', 'd', 10)), 9: Proposal(caller='N6', client_id=100008, input=('set', 'g', 10)), 10: Proposal(caller='N6', client_id=100009, input=('set', 'f', 10)), 11: Proposal(caller='N6', client_id=100012, input=('set', 'a', 10)), 12: Proposal(caller='N6', client_id=100013, input=('set', 'c', 10)), 13: Proposal(caller='N6', client_id=100010, input=('set', 'e', 10)), 14: Proposal(caller='N6', client_id=100011, input=('set', 'b', 10)), 15: Proposal(caller='N6', client_id=100014, input=('get', 'd')), 16: Proposal(caller='N6', client_id=100015, input=('get', 'g')), 17: Proposal(caller='N6', client_id=100016, input=('get', 'f')), 18: Proposal(caller='N6', client_id=100017, input=('get', 'a')), 19: Proposal(caller='N6', client_id=100018, input=('get', 'c')), 20: Proposal(caller='N6', client_id=100019, input=('get', 'e')), 21: Proposal(caller='N6', client_id=100020, input=('get', 'b')), 22: Proposal(caller='N6', client_id=100021, input=('set', 'd', 20)), 23: Proposal(caller='N6', client_id=100022, input=('set', 'g', 20)), 24: Proposal(caller='N6', client_id=100023, input=('set', 'f', 20)), 25: Proposal(caller='N6', client_id=100024, input=('set', 'a', 20)), 26: Proposal(caller='N6', client_id=100025, input=('set', 'c', 20)), 27: Proposal(caller='N6', client_id=100026, input=('set', 'e', 20)), 28: Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)), 29: Proposal(caller='N6', client_id=100028, input=('set', 'd', 30))}) from N2
The issue is that I've de-aliased things on receipt rather than on transmission. This particular bug could be fixed by copying self.decisions
in do_Join
(that is, on transmission), but quite likely there are other examples of this issue.
Should I make a PR to try to address this more generically?
Thanks @djmitche!