onyx-platform/onyx

BookKeeper state log / key filter interaction issue

lbradstreet opened this issue · 3 comments

It's possible for the following issue to occur:

Write and update to the state log
Update the filter immediately after
Async update to the state log fails, message is not acked in the callback.
Message is replayed, however, it has already been seen and is thus filtered out.

You may think, what if we instead only add the key to the filter at the same time the message is acked (ie. in the callback), however then in the time between you will allow duplicates to pass through successfully.

One option is to delete the key from the filter in the callback, given that it will be replayed in due course as it will not be acked as a result of the return code.

While all the other duplicates that came before were ignored, we still have the one that we didn't ack and it will be replayed.

I have managed to produce this scenario with Jepsen, so it's definitely a concern as suspected.

Pruned onyx-aggregation-test/20160204T053208.000Z/n1_logs/onyx.log:
16-Feb-04 06:03:31 n1 INFO [onyx.peer.task-lifecycle] - Filter 36153
16-Feb-04 06:03:50 n1 INFO [onyx.peer.task-lifecycle] - Filter 36153 true
16-Feb-04 06:03:53 n1 WARN [onyx.state.log.bookkeeper] - Unable to complete async write to bookkeeper. BookKeeper exception code: -9
16-Feb-04 06:03:53 n1 WARN [onyx.state.log.bookkeeper] - Unable to complete async write to bookkeeper. BookKeeper exception code: -9
16-Feb-04 06:03:53 n1 WARN [onyx.state.log.bookkeeper] - Unable to complete async write to bookkeeper. BookKeeper exception code: -9
16-Feb-04 06:03:53 n1 WARN [onyx.state.log.bookkeeper] - Unable to complete async write to bookkeeper. BookKeeper exception code: -9
16-Feb-04 06:03:53 n1 WARN [onyx.state.log.bookkeeper] - Unable to complete async write to bookkeeper. BookKeeper exception code: -9

I haven't had a jepsen failure here yet.