GoogleContainerTools/skaffold

GKE LoadBalancer doesn't work with service deployed by Skaffold

thesandlord opened this issue · 15 comments

I have a service with type: Loadbalancer that I deploy with Skaffold. The service creates fine, the load balancer shows as healthy on the GCP console, but when I do kubectl get svc the External IP address never gets resolved and is stuck in <pending>. Everything works if I deploy same service using kubectl apply.

I actually have this on video as well: https://youtu.be/JUFIF9QMN9M?t=1630

This has happened multiple times with multiple clusters, projects, and services. @ahmetb is experiencing the same issue as well.

Right now, I'm thinking there is something Skaffold does to the service (labels?) which is preventing the service from getting the external IP address.

Information

  • Skaffold version: v.0.11.0
  • Operating system: Linux
  • Contents of skaffold.yaml:

Service YAML

apiVersion: v1
kind: Service
metadata:
  name: uptimecheck
  labels:
    app: uptimecheck
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 3000
    protocol: TCP
    name: http
  selector:
    app: "uptimecheck"

Skaffold YAML

apiVersion: skaffold/v1alpha2
kind: Config
build:
  artifacts:
  - imageName: gcr.io/xxx/xxx
deploy:
  kubectl:
    manifests:
      - svc.yaml

Steps to reproduce the behavior

skaffold dev

I am seeing the same.

Unless I use static IP, Service type=LoadBalancer never gets an IP on vanilla GKE cluster:

  • if I go to Google Cloud Console, I see an IP for the LB
  • but the IP is actually not associated with the LB on Kubernetes API
  • overall, hitting the IP doesn't work even though it shows up on the UI

I know at least one more person who deployed the https://github.com/GoogleCloudPlatform/microservices-demo/ and reproed it. So that might be the easiest repro available in open source.

YAML:

apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"frontend-external","namespace":"default"},"spec":{"ports":[{"name":"http","port":80,"targetPort":8080}],"selector":{"app":"frontend"},"type":"LoadBalancer"}}
  creationTimestamp: 2018-07-17T19:12:13Z
  labels:
    cleanup: "true"
    deployed-with: skaffold
    docker-api-version: "1.38"
    skaffold-builder: local
    skaffold-deployer: kubectl
    skaffold-tag-policy: git-commit
  name: frontend-external
  namespace: default
  resourceVersion: "4524845"
  selfLink: /api/v1/namespaces/default/services/frontend-external
  uid: 50092bab-89f5-11e8-a2bb-42010a80009c
spec:
  clusterIP: 10.19.250.58
  externalTrafficPolicy: Cluster
  ports:
  - name: http
    nodePort: 30751
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: frontend
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer: {}

describe output:

Name:                     frontend-external
Namespace:                default
Labels:                   cleanup=true
                          deployed-with=skaffold
                          docker-api-version=1.38
                          skaffold-builder=local
                          skaffold-deployer=kubectl
                          skaffold-tag-policy=git-commit
Annotations:              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"frontend-external","namespace":"default"},"spec":{"ports":[{"name":"http","por...
Selector:                 app=frontend
Type:                     LoadBalancer
IP:                       10.19.250.58
Port:                     http  80/TCP
TargetPort:               8080/TCP
NodePort:                 http  30751/TCP
Endpoints:                10.16.2.99:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

I got another person repro this too.

Progress debugging this: So if I do skaffold delete wait 5 mins (so underlying GCE networking resources deleted) and redeploy with skaffold run I can repro this 100%.

+Bonus: if I do kubectl get -o=yaml service/frontend-external | kubectl apply -f- which causes a "re-apply", then it gets the EXTERNAL-IP right away.

I confirm that this is an issue with how we do the labelling, we update the labels immediately after the service was deployed which then confuses the loadbalancer.

image

AWESOME! Thanks @balopat .

I was seeing the last-applied-configuration even on a clean skaffold run which got me thinking whether skaffold is applying things twice.

Then I thought "I guess this annotation just exists when you deploy things with kubectl-apply". I shouldn't have thought that. Well at least now we know what to fix. 🥇

an update: we are thinking about how to get around the labelling issue, some of the crappy alternatives that came up are:

  1. don't label services at all (works, but not ideal as it's inconsistent)
  2. label loadbalancer services only after external ip is assigned (there might be other issues preventing)
  3. label loadbalancer services after a certain timeout (e.g. 2 minutes is mostly good for GKE)
  4. maybe 2 with a timeout and then 3 combined?
  5. look again deeper into the design of labelling and rethink it (needs more time)

I think ideally this should be fixed in Kubernetes core. The service controller should not be easily confused and get stuck. If you have a reliable repro, please open an issue to kubernetes/kubernetes.

I don't think this is Kubernetes core specific, this looks like a GKE LoadBalancer specific issue. I will open an issue with them though.

repro is super easy:

export app=mysvc; kubectl run $app --image nginx && kubectl expose deployment/$app --port 80 --type LoadBalancer && kubectl edit svc/$app

add a label in the edit command and you'll get the same issue

Kubernetes core specific

Service controller (+cloudprovder support) is in Kubernetes core (https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/service and https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go), therefore I recommend opening a GitHub issue. (:

Just wanted to throw in a little +1 on this, experiencing the same issue

It's fixed in kubernetes/kubernetes#68087 it's currently not picked into any of the 1.12 releases.

Since this is in GKE master and GKE tends to pick up the new k8s versions through a long vetting process (i.e. today the default gke version is 1.9.7, and k8s just released 1.12.0-beta.1), it's unlikely that this will be fixed in GKE in the next 3 months.

It might be worth considering to patch this somehow in Skaffold for the short-term.

@tejal29 this will be solved by swithcing over to helm template as well if we reopen/rebase #2105

I believe this is fixed with #2568 - I'm not able to reproduce this locally on the latest version (v0.34.0). @thesandlord @ahmetb @balopat could one of you test out and make sure it's working for you as well?

confirmed, this should work now!