anvil-verifier/anvil

Assume reconcile is infinitely frequently invoked

marshtompsxd opened this issue · 2 comments

We have discussed the complexity of kube-rs framework's list-then-watch and k8s' MVCC in the meeting and in #72 . Specifying these takes huge efforts but gives us little benefit since we are not verifying k8s or kube-rs implementation anyway. More importantly, all we need from these specifications is to prove reconcile is invoked frequently enough to reach the desired state. Thus, instead of modeling all these internal bits such resource version and cyclic buffer, we should just directly assume the property we want.

Informally, we want the controller's reconcile function to be invoked infinitely frequently. Note that (1) the reconcile function can be invoked with different CR objects so we need weak fairness among all of them and (2) we should only assume the reconcile function gets invoked infinitely frequently for the CR objects that still exist (in etcd).

There are probably multiple ways to write this assumption and a potential solution is to bake it into our top-level state machine. That is, we will have a new action in the top-level state machine that takes a CR object as input. The precondition of this action checks whether the CR object exists in the k8s API (etcd), and its transition function inserts this CR object into the controller state machine's scheduled_reconcile set. The controller state machine has its own internal action that picks a CR object from scheduled_reconcile and invokes reconcile with it. We will have a weak fairness assumption on this new action and use it throughout the liveness proof.

In this way, our proof will also be significantly simplified. A large percentage of our current liveness proof is to reason about when the reconcile is invoked.

One might not like the new action which peeks into both the k8s API state machine and the controller state machine and it is less realistic than the other actions we wrote. But we are assuming the behavior of k8s and kube-rs anyway (by carefully reading source code and documentation).

One of the concerns is whether "infinitely frequently" is too strong and not realistic. I think we should "ensure" it in the implementation by re-queuing the reconcile when it returns. For performance reasons, we can set a short timer for rerunning the reconcile if encountering some transient fault or waiting for external events (e.g., Cassandra decommission), and set a larger timer if the reconcile successfully finishes all its operations.

Please let me know if you think we should use this assumption and if you have a better way to specify it.

@marshtompsxd your plan sounds good to me. I don't think it's an issue that there's an action that peeks into two state machines, it's pretty much exactly what happens inside things like controller-runtime.

closed by #79