aosabook/500lines

Assertion failure in cluster.py

1-p opened this issue · 4 comments

1-p commented

if you run with python run.py 903 you may encounter an assertion failure.

AssertionError: next slot to commit is already decided

@djmitche, are you available to comment on this? I can do it eventually, but it's going to be closer to the end of March.

Thanks for the report @1-p.

1-p commented

@MichaelDiBernardo the output is here https://gist.github.com/1-p/01ff5fe68e81c7d11bcf

(I added some code to print the internal state at the end when assertion failed.)

Somehow do_Decision is called with slot = 27 when decisions contains 0-29.

slot is the next slot that the replica wants to commit, but does not yet have a decision for. The while loop in do_Decision should take care of ensuring this invariant, by looping until self.decisions.get(self.slot) is false. In developing the code, I've seen this happen when decisions dictionaries are aliased between components -- that is, when two different nodes are using the same Python dictionary. Then one node adds key 27 to the dictionary and increments its own slot, and another component comes along and is surprised to see 27 in the dictionary.

So I see N2 welcome N0 in a state where slots 1-26 are decided, and slot is 27. Everything still makes sense here

N2 - T=1002.835 sending Welcome(state={'a': 20, 'c': 20, 'b': 20, 'e': 20, 'd': 30, 'g': 20, 'f': 20}, slot=27, decisions={1: Proposal(caller='N6', client_id=100000, input=('get', 'd')), 2: Proposal(caller='N6', client_id=100003, input=('get', 'g')), 3: Proposal(caller='N6', client_id=100005, input=('get', 'f')), 4: Proposal(caller='N6', client_id=100004, input=('get', 'e')), 5: Proposal(caller='N6', client_id=100002, input=('get', 'b')), 6: Proposal(caller='N6', client_id=100006, input=('get', 'a')), 7: Proposal(caller='N6', client_id=100001, input=('get', 'c')), 8: Proposal(caller='N6', client_id=100007, input=('set', 'd', 10)), 9: Proposal(caller='N6', client_id=100008, input=('set', 'g', 10)), 10: Proposal(caller='N6', client_id=100009, input=('set', 'f', 10)), 11: Proposal(caller='N6', client_id=100012, input=('set', 'a', 10)), 12: Proposal(caller='N6', client_id=100013, input=('set', 'c', 10)), 13: Proposal(caller='N6', client_id=100010, input=('set', 'e', 10)), 14: Proposal(caller='N6', client_id=100011, input=('set', 'b', 10)), 15: Proposal(caller='N6', client_id=100014, input=('get', 'd')), 16: Proposal(caller='N6', client_id=100015, input=('get', 'g')), 17: Proposal(caller='N6', client_id=100016, input=('get', 'f')), 18: Proposal(caller='N6', client_id=100017, input=('get', 'a')), 19: Proposal(caller='N6', client_id=100018, input=('get', 'c')), 20: Proposal(caller='N6', client_id=100019, input=('get', 'e')), 21: Proposal(caller='N6', client_id=100020, input=('get', 'b')), 22: Proposal(caller='N6', client_id=100021, input=('set', 'd', 20)), 23: Proposal(caller='N6', client_id=100022, input=('set', 'g', 20)), 24: Proposal(caller='N6', client_id=100023, input=('set', 'f', 20)), 25: Proposal(caller='N6', client_id=100024, input=('set', 'a', 20)), 26: Proposal(caller='N6', client_id=100025, input=('set', 'c', 20)), 28: Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)), 29: Proposal(caller='N6', client_id=100028, input=('set', 'd', 30))}) to ['N0']

It then gets a decision for slot 27 and commits slots 27-29:

N2.Replica - T=1002.836 received Decision(slot=27, proposal=Proposal(caller='N6', client_id=100026, input=('set', 'e', 20))) from N6
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100026, input=('set', 'e', 20)) at slot 27
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)) at slot 28
N2.Replica - T=1002.836 committing Proposal(caller='N6', client_id=100028, input=('set', 'd', 30)) at slot 29

Before N0 receives the Welcome, which now has slot 27 filled:

N0.Bootstrap - T=1002.860 received Welcome(state={'a': 30, 'c': 30, 'b': 20, 'e': 20, 'd': 30, 'g': 30, 'f': 30}, slot=27, decisions={1: Proposal(caller='N6', client_id=100000, input=('get', 'd')), 2: Proposal(caller='N6', client_id=100003, input=('get', 'g')), 3: Proposal(caller='N6', client_id=100005, input=('get', 'f')), 4: Proposal(caller='N6', client_id=100004, input=('get', 'e')), 5: Proposal(caller='N6', client_id=100002, input=('get', 'b')), 6: Proposal(caller='N6', client_id=100006, input=('get', 'a')), 7: Proposal(caller='N6', client_id=100001, input=('get', 'c')), 8: Proposal(caller='N6', client_id=100007, input=('set', 'd', 10)), 9: Proposal(caller='N6', client_id=100008, input=('set', 'g', 10)), 10: Proposal(caller='N6', client_id=100009, input=('set', 'f', 10)), 11: Proposal(caller='N6', client_id=100012, input=('set', 'a', 10)), 12: Proposal(caller='N6', client_id=100013, input=('set', 'c', 10)), 13: Proposal(caller='N6', client_id=100010, input=('set', 'e', 10)), 14: Proposal(caller='N6', client_id=100011, input=('set', 'b', 10)), 15: Proposal(caller='N6', client_id=100014, input=('get', 'd')), 16: Proposal(caller='N6', client_id=100015, input=('get', 'g')), 17: Proposal(caller='N6', client_id=100016, input=('get', 'f')), 18: Proposal(caller='N6', client_id=100017, input=('get', 'a')), 19: Proposal(caller='N6', client_id=100018, input=('get', 'c')), 20: Proposal(caller='N6', client_id=100019, input=('get', 'e')), 21: Proposal(caller='N6', client_id=100020, input=('get', 'b')), 22: Proposal(caller='N6', client_id=100021, input=('set', 'd', 20)), 23: Proposal(caller='N6', client_id=100022, input=('set', 'g', 20)), 24: Proposal(caller='N6', client_id=100023, input=('set', 'f', 20)), 25: Proposal(caller='N6', client_id=100024, input=('set', 'a', 20)), 26: Proposal(caller='N6', client_id=100025, input=('set', 'c', 20)), 27: Proposal(caller='N6', client_id=100026, input=('set', 'e', 20)), 28: Proposal(caller='N6', client_id=100027, input=('set', 'b', 20)), 29: Proposal(caller='N6', client_id=100028, input=('set', 'd', 30))}) from N2

The issue is that I've de-aliased things on receipt rather than on transmission. This particular bug could be fixed by copying self.decisions in do_Join (that is, on transmission), but quite likely there are other examples of this issue.

Should I make a PR to try to address this more generically?