[Multi Casskop] Rebuild operation can't work
Closed this issue · 0 comments
I created on GCP two GKE cluster, and deployed on each, a casskop operator and external dns.
On top of that, I deploy the multi-casskop operator, and the following Multi-Casskop
resource :
apiVersion: db.orange.com/v1alpha1
kind: MultiCasskop
metadata:
name: multi-casskop-demo
spec:
deleteCassandraCluster: true
base: #<-- Specify the base of our CassandraCluster
apiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: cassandra-demo
namespace: cassandra-demo
labels:
cluster: casskop
spec:
cassandraImage: orangeopensource/cassandra-image:3.11
bootstrapImage: orangeopensource/cassandra-bootstrap:0.1.3
configMapName: cassandra-configmap-v1
service:
annotations:
external-dns.alpha.kubernetes.io/hostname: casskop.external-dns-test.gcp.trycatchlearn.fr.
rollingPartition: 0
dataCapacity: "20Gi"
dataStorageClass: "standard-wait"
imagepullpolicy: IfNotPresent
# imagepullpolicy: Always
hardAntiAffinity: false
deletePVC: true
autoPilot: false
gcStdout: false
autoUpdateSeedList: false
debug: false
maxPodUnavailable: 1
nodesPerRacks: 1
runAsUser: 999
resources:
requests: &requests
cpu: '1'
memory: 2Gi
limits: *requests
status:
seedlist: #<-- at this time the seedlist must be fullfilled manually with known predictive name of pods
- cassandra-demo-dc1-rack1-0.casskop.external-dns-test.gcp.trycatchlearn.fr
- cassandra-demo-dc1-rack1-1.casskop.external-dns-test.gcp.trycatchlearn.fr
- cassandra-demo-dc2-rack1-0.casskop.external-dns-test.gcp.trycatchlearn.fr
- cassandra-demo-dc2-rack1-1.casskop.external-dns-test.gcp.trycatchlearn.fr
override: #<-- Specify overrides of the CassandraCluster depending on the target kubernetes cluster
gke-master-west1-b:
spec:
topology:
dc:
- name: dc1
nodesPerRacks: 2
numTokens: 256
labels:
failure-domain.beta.kubernetes.io/region: europe-west1
rack:
- name: rack1
rollingPartition: 0
labels:
failure-domain.beta.kubernetes.io/zone: europe-west1-b
gke-slave-west1-c:
spec:
topology:
dc:
- name: dc2
nodesPerRacks: 2
numTokens: 256
labels:
failure-domain.beta.kubernetes.io/region: europe-west1
rack:
- name: rack1
rollingPartition: 0
labels:
failure-domain.beta.kubernetes.io/zone: europe-west1-c
I have one dc on the gke-master-west1-b
k8s cluster (dc1
) and a second one on the gke-slave-west1-c
k8s cluster (dc2
).
# gke-master-west1-b
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cassandra-demo-dc1-rack1-0 1/1 Running 0 21h
cassandra-demo-dc1-rack1-1 1/1 Running 0 21h
casskop-cassandra-operator-b7d96f878-rgvd5 1/1 Running 0 23h
external-dns-7f787b9c77-8xpnk 1/1 Running 0 23h
multi-casskop-5bc5fb4588-kbg8p 1/1 Running 0 22h
# gke-slave-west1-c
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cassandra-demo-dc2-rack1-0 1/1 Running 0 21h
cassandra-demo-dc2-rack1-1 1/1 Running 0 21h
casskop-cassandra-dca1e1c411f84af1bb36e25451dffd4d-6bb8867ljp52 1/1 Running 0 21h
external-dns-848d48fb56-zzxmk 1/1 Running 0 23h
cassandra@cassandra-demo-dc1-rack1-0:/$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.52.3.4 189.92 KiB 256 53.3% 4e3017b7-27c4-410e-8fa2-7888a3eaf599 rack1
UN 10.52.4.4 294.74 KiB 256 51.1% 174df7cd-b29a-415a-8cdf-1bce6fde17b7 rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.8.3.4 282.14 KiB 256 48.8% c27c0ff3-c24e-49eb-aa9c-e24c0a7649f5 rack1
UN 10.8.4.5 298.26 KiB 256 46.8% 331565f8-1648-4900-9c11-a2a7d94b33f5 rack1
Once the cassandra cluster is ready, I tried to perform a rebuild operation on dc2
pods from dc1
, using the casskop plugin :
$ kubectl casskop rebuild --pod cassandra-demo-dc2-rack1-0 dc1
Casskop operator doesn't succeed in performing rebuild, because it is not aware of the second DC. This issue is due to the fact that, to perform the rebuild operation, the operator check the CassandraCluster
resource topology locally, which doesn't contain the definition of remote dc
:
https://github.com/Orange-OpenSource/casskop/blob/master/pkg/controller/cassandracluster/pod_operation.go#L688
An acceptable solution seems to be to remove this check and let the jolokia
call, manage the error : https://github.com/Orange-OpenSource/casskop/blob/master/pkg/controller/cassandracluster/pod_operation.go#L700.