Error syncing load balancer: failed to ensure load balancer: instance not found

Question

Error syncing load balancer: failed to ensure load balancer: instance not found

Stringls opened this issue a year ago · 5 comments

/kind bug

What steps did you take and what happened:

I am not able to setup ingress-nginx on GCE cluster following this tutorial https://kubernetes.github.io/ingress-nginx/deploy/#gce-gke
The pods are running, but LB svc is in pending state. In events section I see the bellow error

Error syncing load balancer: failed to ensure load balancer: instance not found

What did you expect to happen:

Ingres NGINX by Kubernets is setup properly and LB in GCP is created and running.

Anything else you would like to add:

That is weird, but I was able to deploy a LB in GCP using the same tutorial above. If I am not mistaken the configuration was the same, I haven't changed anything. When I am trying to replicate the scenario - I get the error above.

Environment:

mgmt cluster:

Cluster-api version: 1.4.1
CAPI GCP version: 1.3.0
Kubernetes version: (use kubectl version): 1.26.5
OS (e.g. from /etc/os-release): GKE's VM image

workload (GCE) cluster:

kuberntes version: 1.25.4
OS: ubuntu-20.04

The image for GCE cluster was built following your tutorial with changing Kube version to 1.25.4

mloiseleur commented a year ago

/close

Answer 1 · 2023-08-08T14:14:25.000Z

I encountered the same issue with Traefik Ingress. Using the same code, I have two cluster (one from 60 days and one recent). The first has a LB without any issue, and it's reproducible
The other use the same YAML code, is more recent and cannot find the instance. It's reproducible, even if I drop the cluster.

Kube-controller indicates that:

E0808 13:56:29.068892       1 gce_instances.go:633] Failed to retrieve instances: [xxx-4278439806-7vbnx xxx-4278439806-df7kg xxx-4278439806-v25b5]
E0808 13:56:29.068935       1 gce_loadbalancer.go:174] Failed to EnsureLoadBalancer(xxx, traefik, traefik, xxxx, europe-west4), err: instance not found
E0808 13:56:29.068977       1 controller.go:320] error processing service traefik/traefik (will retry): failed to ensure load balancer: instance not found

This seems really weird because:

kubectl get nodes is working as expected and list expected nodes
kubectl get pods -l (with the filter of the Service) is working as expected and list the expected pods

Answer 2 · 2023-08-09T13:31:02.000Z

@Stringls I found it (!)

It was working on the first cluster, because KubeAdmControlPlane and Workers were on the same Zone.
It was not working on the second cluster, because Control Plane node was launched on europe-west4-a and Workers nodes on europe-west4-b.

Forcing Control Plane to go on the same Zone, with spec.failureDomains on GCPCluster fixed the issue !

Answer 3 · 2023-08-09T13:31:22.000Z

@mloiseleur: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Answer 4 · 2023-08-09T13:33:24.000Z

@mloiseleur Thank you very much for posting a solution. Have a good one!