[BUG] Casskop refuses to set nodesPerRacks to 0 even after cleaning up cassandra keyspaces

Question

[BUG] Casskop refuses to set nodesPerRacks to 0 even after cleaning up cassandra keyspaces

srteam2020 opened this issue 3 years ago · 5 comments

srteam2020 commented 3 years ago

Bug Report

We find that sometimes casskop and Cassandra disagrees on the keyspace which prevents us from setting nodesPerRacks to 0.

Create a CassandraCluster with 3 DC, each DC with 1 rack, and nodesPerRacks is set to 1
Clean up Cassandra pods by running kubectl casskop cleanup --pod cassandra-cluster-dc1-rack1-0. The operator will use Jolokia to communicate with cassandra and issue the keyspace deletion operation. In operator's log, it shows,

[cassandra-cluster-dc1-rack1-0.cassandra-cluster]: Cleanup of keyspace system_distributed
[cassandra-cluster-dc1-rack1-0.cassandra-cluster]: Cleanup of keyspace system_auth
[cassandra-cluster-dc1-rack1-0.cassandra-cluster]: Cleanup of keyspace system_traces

Set nodesPerRacks of first DC to 0, the operator still reject the operation due to detection of existing keyspaces,

The Operator has refused the ScaleDown. Keyspaces still having data [system_distributed system_auth system_traces]

In step 2 it seems that the keyspace is already deleted and the node is already cleaned up, but some how in step 3 we still cannot set nodesPerRacks to 0, thus cannot delete a dc.

What did you do?
Try to set nodesPerRacks to 0 before deleting a dc.

What did you expect to see?
nodesPerRacks is set to 0.

What did you see instead? Under which circumstances?
The operator still reject the operation due to detection of existing keyspaces.

Environment

casskop version:
f87c8e0 (master branch)
Kubernetes version information:
1.18.9
Cassandra version:
3.11

Answer 1 · 2021-09-28T13:26:04.000Z

@srteam2020 if you can have a test like https://github.com/Orange-OpenSource/casskop/tree/master/test/kuttl/multi-dcs that can reproduce it that would help. That test does scale down a DC and sets it to zero without a problem. So there might be a situation where it fails for some unknown reason

Answer 2 · 2021-09-28T14:33:40.000Z

Hello @cscetbon
Thanks for the reply. Yes let me check whether we can use this test to debug.

Answer 3 · 2021-10-29T04:31:22.000Z

Thanks for the reply. Yes let me check whether we can use this test to debug.

Any news ?

Answer 4 · 2021-10-29T05:05:42.000Z

@cscetbon
We have not solved the issue yet. But thanks to your pointer I find that in your test workload, where there are two dcs dc1 and dc2, scaling down dc1 is not allowed but scaling down dc2 is OK. Does it mean when there are multiple dcs, if we want to scale down a dc we have to start from the last dc? And when nodesPerRacks of the last dc is set to 0 the sts for the last dc will automatically be deleted?

If that is the case, it probably solves our problem. We will go back and try this again.

Answer 5 · 2022-03-08T03:11:53.000Z

Sorry for the late response, I missed this comment. Nah, the reason dc1 can't scale down is because there is data hosted on that datacenter. During the test I change the network strategy used from SimpleStrategy to NetworkTopologyStrategy excluding dc2 to be able to scale it down. See https://github.com/Orange-OpenSource/casskop/blob/master/test/kuttl/multi-dcs/03-disableReplToDC2.yaml#L8

Do you know where it's coming from or should we close the ticket for now ?