Orange-OpenSource/casskop

[Multi Casskop] Rebuild operation can't work

Closed this issue · 0 comments

I created on GCP two GKE cluster, and deployed on each, a casskop operator and external dns.
On top of that, I deploy the multi-casskop operator, and the following Multi-Casskop resource :

apiVersion: db.orange.com/v1alpha1
kind: MultiCasskop
metadata:
  name: multi-casskop-demo
spec:
  deleteCassandraCluster: true
  base: #<-- Specify the base of our CassandraCluster
    apiVersion: "db.orange.com/v1alpha1"
    kind: "CassandraCluster"
    metadata:
      name: cassandra-demo
      namespace: cassandra-demo
      labels:
        cluster: casskop
    spec:
      cassandraImage: orangeopensource/cassandra-image:3.11
      bootstrapImage: orangeopensource/cassandra-bootstrap:0.1.3
      configMapName: cassandra-configmap-v1
      service:
        annotations:
          external-dns.alpha.kubernetes.io/hostname: casskop.external-dns-test.gcp.trycatchlearn.fr.
      rollingPartition: 0
      dataCapacity: "20Gi"
      dataStorageClass: "standard-wait"
      imagepullpolicy: IfNotPresent
#      imagepullpolicy: Always
      hardAntiAffinity: false
      deletePVC: true
      autoPilot: false
      gcStdout: false
      autoUpdateSeedList: false
      debug: false
      maxPodUnavailable: 1
      nodesPerRacks: 1
      runAsUser: 999
      resources:
        requests: &requests
          cpu: '1'
          memory: 2Gi
        limits: *requests
    status:
      seedlist:   #<-- at this time the seedlist must be fullfilled manually with known predictive name of pods
        - cassandra-demo-dc1-rack1-0.casskop.external-dns-test.gcp.trycatchlearn.fr
        - cassandra-demo-dc1-rack1-1.casskop.external-dns-test.gcp.trycatchlearn.fr
        - cassandra-demo-dc2-rack1-0.casskop.external-dns-test.gcp.trycatchlearn.fr
        - cassandra-demo-dc2-rack1-1.casskop.external-dns-test.gcp.trycatchlearn.fr    

  override: #<-- Specify overrides of the CassandraCluster depending on the target kubernetes cluster
    gke-master-west1-b:
      spec:
        topology:
          dc:
            - name: dc1
              nodesPerRacks: 2
              numTokens: 256
              labels:
                failure-domain.beta.kubernetes.io/region: europe-west1
              rack:
                - name: rack1
                  rollingPartition: 0
                  labels:
                    failure-domain.beta.kubernetes.io/zone: europe-west1-b
    gke-slave-west1-c:
      spec:
        topology:
          dc:
            - name: dc2
              nodesPerRacks: 2
              numTokens: 256
              labels:
                failure-domain.beta.kubernetes.io/region: europe-west1
              rack:
                - name: rack1
                  rollingPartition: 0
                  labels:
                    failure-domain.beta.kubernetes.io/zone: europe-west1-c

I have one dc on the gke-master-west1-b k8s cluster (dc1) and a second one on the gke-slave-west1-c k8s cluster (dc2).

# gke-master-west1-b
$ kubectl get pods
NAME                                         READY   STATUS    RESTARTS   AGE
cassandra-demo-dc1-rack1-0                   1/1     Running   0          21h
cassandra-demo-dc1-rack1-1                   1/1     Running   0          21h
casskop-cassandra-operator-b7d96f878-rgvd5   1/1     Running   0          23h
external-dns-7f787b9c77-8xpnk                1/1     Running   0          23h
multi-casskop-5bc5fb4588-kbg8p               1/1     Running   0          22h

# gke-slave-west1-c
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
cassandra-demo-dc2-rack1-0                                        1/1     Running   0          21h
cassandra-demo-dc2-rack1-1                                        1/1     Running   0          21h
casskop-cassandra-dca1e1c411f84af1bb36e25451dffd4d-6bb8867ljp52   1/1     Running   0          21h
external-dns-848d48fb56-zzxmk                                     1/1     Running   0          23h

cassandra@cassandra-demo-dc1-rack1-0:/$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.52.3.4  189.92 KiB  256          53.3%             4e3017b7-27c4-410e-8fa2-7888a3eaf599  rack1
UN  10.52.4.4  294.74 KiB  256          51.1%             174df7cd-b29a-415a-8cdf-1bce6fde17b7  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.8.3.4   282.14 KiB  256          48.8%             c27c0ff3-c24e-49eb-aa9c-e24c0a7649f5  rack1
UN  10.8.4.5   298.26 KiB  256          46.8%             331565f8-1648-4900-9c11-a2a7d94b33f5  rack1

Once the cassandra cluster is ready, I tried to perform a rebuild operation on dc2 pods from dc1, using the casskop plugin :

$ kubectl casskop rebuild --pod cassandra-demo-dc2-rack1-0 dc1

Casskop operator doesn't succeed in performing rebuild, because it is not aware of the second DC. This issue is due to the fact that, to perform the rebuild operation, the operator check the CassandraCluster resource topology locally, which doesn't contain the definition of remote dc :
https://github.com/Orange-OpenSource/casskop/blob/master/pkg/controller/cassandracluster/pod_operation.go#L688

An acceptable solution seems to be to remove this check and let the jolokia call, manage the error : https://github.com/Orange-OpenSource/casskop/blob/master/pkg/controller/cassandracluster/pod_operation.go#L700.