GoogleCloudPlatform/kubernetes-engine-samples

Problems while using GKE Autopilot+ASM for blue green deployment

jingliangjl opened this issue · 4 comments

Hi, we have encountered some problems while using GKE Autopilot+ASM for Blue/Green deployment, hope we can get some help or advice, thank you.

Our cluster details are as follows:
Our cluster is in GKE autopilot mode and uses Google's ASM and cert manager.
Currently, there are four main workloads in the cluster, each deployment has two replicas.:
a front-end blue app
a back-end blue app
a front-end green app
a back-end green app
here are the resources settings in frontend and backend deployment.yaml

# front-deployment.yaml
resources:
      limits:
        cpu: 250m
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 512Mi

# backend-deployment.yaml
resources:
    limits:
      cpu: 500m
      memory: 2048Mi
    requests:
      cpu: 500m
      memory: 2048Mi

# every istio-proxy resources in every pod
Resources
cpu
  requested 500m limit: 500m
ephemeral-storage
  requested 1Gi limit: 1Gi
memory
  requested 512Mi limit: 512Mi

The problem:
When we use helm upgrade to update a version, such as blue version, the blue version of the front-end and back-end pods are showing running status after the successful update.
But after a few minutes, we find that some pods in the cluster on the same node will all be terminated and rescheduled on the new node nodes, these terminated pods may contain both blue and green versions of the app, which is not what we want, because this termination situation will lead to our service unavailable.

We have repeatedly read the autopilot documentation and have tried the following solutions but none of them worked.

  1. Add affinity configuration, but autopilot itself has limitations on affinity
  2. Set RollingUpdate, where maxUnavailable is 0 and maxSurge is 2(We are going to try to set the value of maxSurge to 1)
  3. Add priorityClass for all app deployments and set preemptionPolicy to never, but found that pods will still be evicted

We will also check later if our resouces are configured correctly, and try to configure Autoscaler and Pod Disruption Budget, but not quite sure if it is the right attempt.

Please have any suggestions or can we know how autopilot does the scheduling? So that we can achieve stable blue/green deployment

Hi @jingliangjl, thank you for your question. I notice you've closed the issue-- Is it because you've figured out a solution for your problem? If so, we'd like to hear what went wrong so we can properly document this on our end!

Afaik, Autopilot can recreate (bigger) nodes to account for increased resource demand and may shift workloads over, but it should be respecting concepts such as Pod anti-affinity and disruption budgets.

Let us know if you need any more help.

Hi

Thank you for the reply.

We noticed one thing in the documentation of VPA that mentioned "To limit the amount of Pod restarts, use a Pod disruption budget.", so we used PDB for configuration to ensure that at least one pod in the same deployment is guaranteed to be available when autoscaler is working, which seems to meet our needs so far.

However, there is a question about pod anti-affinity. In the documentation of autopilot overview, it is mentioned that there is a key restriction on pod affinity. Does such a restriction also exist for pod anti-affinity?

I believe that is the case, since you don't have control over the Node labels in Autopilot, so rather than arbitrary topology keys, you're given a list of legal keys to use, that is: topology.kubernetes.io/region, topology.kubernetes.io/zone, failure-domain.beta.kubernetes.io/region, and failure-domain.beta.kubernetes.io/zone.

You may also find some more information on affinity and anti-affinity here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity

I hope that helps!

I have no more questions, thank you so much for the help, really appreciated it!