[BUG] Replica of cassandraCluster gets inconsistent under different reconciliation interleaving
srteam2020 opened this issue · 3 comments
Bug Report
The casskop operator has the following logic to invalidate user request which scales down the cluster to 0, and the replica number of the cluster is then determined by the previous cluster state stored in the annotations of the operator instance,
//Global scaleDown to 0 is forbidden
if cc.Spec.NodesPerRacks == 0 {
logrus.WithFields(logrus.Fields{"cluster": cc.Name}).
Warningf("The Operator has refused the change on NodesPerRack=0 restore to OldValue[%d]",
oldCRD.Spec.NodesPerRacks)
cc.Spec.NodesPerRacks = oldCRD.Spec.NodesPerRacks
needUpdate = true
}
Here the oldCRD
refers last-applied-configuration for the CRD, which is dumped and stored inside operator's annotation field during every reconcile.
We ran a workload that changes replica (NodesPerRacks) from 2 -> 1 -> 0. Ideally, the replicas should remain 1 since scaling down to 0 not allowed. However, we observed there are 2 replicas at the end.
The reason is that the goroutine of performing reconciliation (g1) and the goroutine of updating the cassandraCluster object (g2) are concurrent, and certain interleaving of the two goroutines can lead to the unexpected scaling behavior.
Ideally, the following interleaving will lead to the correct result:
- g2: set NodesPerRacks (from 2) to 1
- g1: read NodesPerRacks and scales cluster from 2 to 1
- g2: set NodesPerRacks (from 1) to 0
- g1: read the oldCRD state (NodesPerRacks: 1), and set the NodesPerRacks (also scale the cluster) to 1
And we find that the following interleaving can lead to the unexpected scaling behavior mentioned above:
- g2: set NodesPerRacks (from 2) to 1
- g2: set NodesPerRacks (from 1) to 0
- g1: read the oldCRD state (NodesPerRacks: 2), and set the NodesPerRacks (also scale the cluster) to 2 since scaling to 0 is of course not allowed
What did you do?
As mentioned above, we changed the replica (NodesPerRacks) from 2 -> 1 -> 0.
What did you expect to see?
There should be 1 replica at the end.
What did you see instead? Under which circumstances?
We observed that it ended up with 2 replicas.
Environment
-
casskop version:
f87c8e0 (master branch) -
Kubernetes version information:
1.18.9 -
Cassandra version:
3.11
Possible Solution
This issue is probably hard to fix from the operator side, as how the operator reads events from users (e.g., scaling up/down) and how the different goroutines interleave with each other are decided by the controller-runtime
package, not the operator code. We have not found a good way to solve this issue for now. We also observed some similar behavior in other controllers. We are still working on finding a good solution to this kind of issues and will send a PR for that.
This issue might be hard to reproduce as it will only manifest when the particular interleaving happens (e.g., when the reconcile goroutine runs slow). We are recently building a tool https://github.com/sieve-project/sieve to reliably detect and reproduce bugs like the above one in various k8s controllers. This bug can be automatically reproduced by just a few commands using our tool. Please take a try if you also want to reproduce the bug.
To use the tool, please first run the environment check:
python3 check-env.py
If the checker passes, please build the Kubernetes and controller images using the commands below as our tool needs to set up a kind cluster to reproduce the bug:
python3 build.py -p kubernetes -m obs-gap -d DOCKER_REPO_NAME
python3 build.py -p casskop-operator -m obs-gap -d DOCKER_REPO_NAME -s f87c8e05c1a2896732fc5f3a174f1eb99e936907
-d DOCKER_REPO_NAME
should be the docker repo that you have write access to, -s f87c8e05c1a2896732fc5f3a174f1eb99e936907
is the commit where the bug is found.
Finally, run
python3 sieve.py -p casskop-operator -t nodesperrack -d DOCKER_REPO_NAME
It will try to scale the cassandra cluster from 2-> 1 -> 0 and also manipulate the goroutine interleaving of the operator to trigger the previously mentioned corner case interleaving.
The bug is successfully reproduced if you can see the message below
[ERROR] pod SIZE inconsistency: 2 seen after learning run, but 3 seen after testing run
Ideally there should be 2 pods eventually (1 for the operator pod and 1 for the cassandra replica), but after the test run we observe 3 pods (1 for the operator pod and 2 cassandra replicas as mentioned above).
For more information please refer to https://github.com/sieve-project/sieve#sieve-testing-datacenter-infrastructures-using-partial-histories and https://github.com/sieve-project/sieve/blob/main/docs/reprod.md#orange-opensource-casskop-342
Please let me know if you encounter any problems when reproducing the bug with the tool.
@srteam2020 is it something you'd like to work on ?
@cscetbon Thank you for the reply.
We are looking for a clean fix for this problem. I will get back and send a PR if we find a solution.