Orange-OpenSource/casskop

Pod is stuck on pending when resources exhausted

Closed this issue · 9 comments

I've created cassandra cluster with following configuration:

apiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
  name: cassandra-cluster
  labels:
    cluster: k8s.kaas
spec:
  cassandraImage: cassandra:3.11.6
  bootstrapImage: orangeopensource/cassandra-bootstrap:0.1.4
  configMapName: cassandra-configmap-v1
  dataCapacity: "10Gi"
  dataStorageClass: ""
  imagepullpolicy: IfNotPresent  
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: true
  gcStdout: true
  autoUpdateSeedList: true
  maxPodUnavailable: 1
  resources:         
    requests:
      cpu: '1000m'
      memory: '2Gi'
    limits:
      cpu: '1000m'
      memory: '2Gi'
  topology:
    dc:
      - name: dc1
        nodesPerRacks: 1
        rack:
          - name: rack1
          - name: rack2
          - name: rack3

but only first rack started because memory has been exhausted on cluster. The second was stuck in "pending" with error message telling there are not enough resources.

Now, I've modified cluster definition and reduced limits:

  resources:
    requests:
      cpu: '500m'
      memory: '512Mi'
    limits:
      cpu: '1000m'
      memory: '1024Mi'

But after applying these changes with kubectl the cluster is still stuck on second rack in pending state:

cassandra-cluster-dc1-rack1-0                 1/1     Running   0          3h37m
cassandra-cluster-dc1-rack2-0                 0/1     Pending   0          3h35m
casskop-cassandra-operator-5856b56ccd-bjb85   1/1     Running   0          33h

Logs say:

time="2020-04-05T19:00:59Z" level=info msg="We will request : cassandra-cluster-dc1-rack1-0.cassandra-cluster to catch hostIdMap" cluster=cassandra-cluster err="<nil>"
time="2020-04-05T19:00:59Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=cassandra-cluster dc-rack=dc1-rack1
time="2020-04-05T19:01:01Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=cassandra-cluster dc-rack=dc1-rack1
time="2020-04-05T19:01:01Z" level=info msg="[cassandra-cluster][dc1-rack2]: Initializing StatefulSet: Replicas Number Not OK: 1 on 1, ready[0]"
time="2020-04-05T19:01:01Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=cassandra-cluster dc-rack=dc1-rack2
time="2020-04-05T19:01:01Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=cassandra-cluster dc-rack

In particular the limits and requests on first rack are not changed:

    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   2Gi

Hi, this kind of state is not managed by the operator, because from the operator's point of view, the statefulset is performing something (and it doesn't know what). And before applying any changes to the statefulset, in this case updating limits and requests, the operator is waiting for the statefulset to successfully run, that in your case will never happen.

To work around this situation, you have to manually delete the statefulset :
kubectl delete statefulset cassandra-cluster-dc1-rack2. After this, the operator will recreate the statefulset with the new requests and limits.

Note : This operation will not remove data, because your data will still be there (persistent volume will not be removed). I think that the only thing you should have to do, would be a rebuild to the rack.

@fdehay, @cscetbon this kind of issue, should it be something that the operator automatically manage?

Coudn't operator check reason pod is not ready (i.e. not enough resources), and apply changes in resource limits? I feel this can be a situation even after cluster is created (e.g. scaling up or reducing number of nodes without verifying limits).

Also, are you sure that pvc won't be removed? there is deletePVC: true config option

This field allow you to specify to delete or not the PVC, when the CassandraCluster resource is removed (let's have a look on documentation here)

When you remove a Statefulset, this doesn't remove the associated PVC, if you check statefulset documentation https://kubernetes.io/docs/tasks/run-application/delete-stateful-set/#complete-deletion-of-a-statefulset (you can check the previous section too, for more explanation), you can see that the PVCs must be removed manually.

It's a good question. I feel like users should not have to deal with K8s but only with the operator. We could accept a PR to circumvent such issue in a limited cases like "no memory resources". At the moment we use a PodDisruptionBudget to know when K8s is taking care of a task or if it's on the operator. We should probably try to flesh out the details if we wanna do otherwise.

We discussed the other day how to prevent the user from being in front of a stuck statefullset. Then it was about non existing storage class. Here we should test the memory available?
In my opinion the user should know how much memory he has left before requesting a cluster. Same as he should know the storage classes and the vcpus available.
But we could improve the documentation and describe what to do in such a case (pending statefullset).

Shouldn't operator at least refuse to provision statefulset if it is known that there won't be enough resources to fully provision it?

@sheerun it's kubernetes via the statefulset that refuse to make the operation if there is not enought resources, so that the operator cannot know about it before it happen, and because it's the statefulset that is blocked in waiting state, the operator simply also wait for it.

Meanwhile we have introduced a special parameter to deal with this unlockNextOperation. This special parameter will be used once by the operator (each time it uses it it remove it from spec), and this will allow the operator to push new version of a statefulset even if the statefulset is not in a ready state.

Depending on the actual state of your cluster you may need to add it more than one time, but it can allows you to rollback a configuration which won't works.

this is documented here : https://github.com/Orange-OpenSource/casskop/blob/aa67cd3d2cd6db9833a25f856a4b2f9d30e8d454/documentation/troubleshooting.md#operator-cant-perform-the-action

@erdrix anything to add here ? I know you were supposed to try the dry-run feature available at the api level. Does it help or is it the same as the dry-run option of kubectl ?