Early experiments of applying Acto
marshtompsxd opened this issue · 2 comments
@tylergu is working on applying Acto to test our controllers. Tyler starts by applying Acto to the zookeeper controller (which is not verified yet) and finds some interesting bug caused by a corner case input: if we first downscale the zookeeper cluster to 0 replicas then scale up, scale up will never succeed. The reason is that before updating the statefulset for scale up, the controller needs to first set the expected cluster size by communicating with the zookeeper cluster, and since there isn't any zookeeper server running (replica=0) to handle the request, the controller will get stuck in an error loop forever.
This bug does not appear in the reference zookeeper controller because they set the minimum value of replica number as 1 so the controller will never scale the zookeeper cluster down to 0. A similar fix will work for our controller.
An important alarm this example sends is that we should carefully design our modeling of the environment, especially the external API specification. To be able to rule out such bugs during verification, we should model the behavior that any request sent to the zookeeper API will not succeed if there is not any zookeeper server (pod) running (replicas=0).
Wahahahahahaha
Some details about the bug:
Triggering
The bug happens when Acto first changes the spec.replicas
from 3
to 0
, and then from 0
to 3
.
What happened
The ZooKeeper controller communicates with the ZooKeeper system to update some metadata information when the spec.replicas
is changed (it does not do so when the CR is initially created). When spec.replicas
is changed to 0
, all replicas are deleted so there is no longer a ZooKeeper ensemble. And when spec.replicas
is changed back to 3, the ZooKeeper controller tries to communicate with the ZooKeeper system but fails. This causes the controller to never update the ZooKeeper statefulset with the desired replicas number.
Acto found this bug with differential oracle: if it goes directly from empty state to 3-replica, there are 3 pods. But if it goes from 0-replica to 3-replica, there is 0 pod.
If it has the mapping from the spec.replicas
to statefulset's spec.replicas
, the consistency oracle can also find this bug.