Challenges in modelling the controller's list-then-watch behavior
marshtompsxd opened this issue · 2 comments
Every controller has multiple watchers to watch for different types of resources. Each watcher is used to trigger the controller's reconcile function when certain state updates happen.
A watcher is implemented as a state machine:
(1) it starts from Empty
state, and issues a list request to the Kubernetes API. If the list returns successfully with a resource version number, it advances to InitListed
; otherwise it stays in Empty
and re-issues the list. The resource version is basically a global counter that increments whenever (any state object of) the cluster state gets updated (each update results in a new version) and is used for concurrency control. More information can be found here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency. The resource version returned by the list represents the version of the cluster state when the list query happens. Each object read by the list request will be
(2) In the InitListed
state, the watcher issues a watch request with the resource version returned by the list and advances to the Watching
state. The watch request asks the apiserver to stream all the state updates starting from the resource version, if they exist at the apiserver's watch cache. If the apiserver does not preserve such "old" state updates, watch request returns HTTP 410 and the watcher goes back to Empty
; otherwise, it stays in Watching
and keeps receiving the state updates.
All the state object read by the list request and the state updates streamed by the watch request will be converted to a reconciliation request sent to an internal scheduler to trigger the reconcile function.
To precisely specify the list-then-watch behavior, we need to:
(1) Model the resource version
(2) Model the apiserver's watch cache (which is a cyclic buffer)
(3) Specify the list/watch request using recursion (since loop is not supported in spec code) and prove their post-conditions
(1) and (2) are tricky because resource version is updated by every single update in the entire k8s cluster but we will not be able to model every single event.
On the other hand, the controller's level triggering pattern makes it less sensitive to the result of list-then-watch: the reconcile function is unaware of "what" event triggers it, but only cares that how to reach the desired state from this current state. That is, even if the controller's watcher misses some updates, as long as there are enough events that trigger the reconcile function to reach the desired state, the controller should be fine. Further, if we assume the controller always requeues the reconcile after each round of reconcile, the controller only needs one event to trigger the first reconcile.
"(1) and (2) are tricky because resource version is updated by every single update in the entire k8s cluster but we will not be able to model every single event."
You might not need to model every single event. You can treat rv as being an opaque value for now. There are certain interactions where it matters whether the controller is modifying the correct version of a resource, which is where you want the rv check to happen. Even there, the rv itself can be opaque, you just want the controller to handle the possibility that the rv could have changed in between (which would be the typical success/error code branches that we are planning for).
Problem solved by the schedule_controller_reconcile
action