aws-load-balancer-controller v2.7.0 : no endpoints available for service aws-load-balancer-webhook-service
dev-travelex opened this issue · 5 comments
Describe the bug
A concise description of what the bug is.
Trying to deploy Ingress controller via the helm chart, but getting this error
Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service
Logs from ALB controller:
{"level":"info","ts":"2024-02-09T12:55:48Z","msg":"version","GitVersion":"v2.7.0","GitCommit":"ed00c8199179b04fa2749db64e36d4a48ab170ac","BuildDate":"2024-02-01T01:05:49+0000"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"adding health check for controller"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"adding readiness check for webhook"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-service"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-ingressclassparams"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-networking-v1-ingress"}
{"level":"info","ts":"2024-02-09T12:55:48Z","logger":"setup","msg":"starting podInfo repo"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":"2024-02-09T12:55:50Z","msg":"Starting server","kind":"health probe","addr":"[::]:61779"}
{"level":"info","ts":"2024-02-09T12:55:50Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":"2024-02-09T12:55:50Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0209 12:55:50.353502 1 leaderelection.go:248] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...
Steps to reproduce
Deployed via helm chart
resource "helm_release" "aws-load-balancer-controller" {
name = "aws-load-balancer-controller"
depends_on = [null_resource.post-policy, kubernetes_service_account.alb_controller]
atomic = true
repository = "https://aws.github.io/eks-charts"
chart = "aws-load-balancer-controller"
namespace = "kube-system"
timeout = 600
set {
name = "clusterName"
value = data.aws_eks_cluster.cluster.name
}
set {
name = "serviceAccount.create"
value = false
}
set {
name = "serviceAccount.name"
value = "aws-load-balancer-controller"
}
set {
name = "vpcId"
value = data.aws_vpc.best.id
}
}
Expected outcome
A concise description of what you expected to happen.
Environment
- Chart name: aws-load-balancer-controller
- Chart version: 2.7.0
- Kubernetes version:
- Using EKS (yes/no), if so version? 1.28
Additional Context:
@dev-travelex
would you help run
kubectl -n kube-system describe deployment/aws-load-balancer-controller
and kubectl -n kube-system describe endpoints/aws-load-balancer-webhook-service
I went to 2.6.2
after getting that error. And deployed 2.7.0
again after seeing your comment. This time, I didn't get the webhook error. Nonetheless, here are logs :
kubectl -n kube-system describe deployment/aws-load-balancer-controller
Name: aws-load-balancer-controller
Namespace: kube-system
CreationTimestamp: Sat, 10 Feb 2024 01:20:22 +0000
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.7.0
helm.sh/chart=aws-load-balancer-controller-1.7.0
Annotations: deployment.kubernetes.io/revision: 2
meta.helm.sh/release-name: aws-load-balancer-controller
meta.helm.sh/release-namespace: kube-system
Selector: app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: aws-load-balancer-controller
Containers:
aws-load-balancer-controller:
Image: public.ecr.aws/eks/aws-load-balancer-controller:v2.7.0
Ports: 9443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--cluster-name=nonprod-bliss-eks
--ingress-class=alb
--aws-vpc-id=vpc-0c6f43c09d80645f7
Liveness: http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
Readiness: http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
Environment: <none>
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: aws-load-balancer-tls
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: aws-load-balancer-controller-6d59ff6457 (0/0 replicas created)
NewReplicaSet: aws-load-balancer-controller-b55586fb8 (2/2 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 22m deployment-controller Scaled up replica set aws-load-balancer-controller-6d59ff6457 to 2
Normal ScalingReplicaSet 6m51s deployment-controller Scaled up replica set aws-load-balancer-controller-b55586fb8 to 1
Normal ScalingReplicaSet 5m55s deployment-controller Scaled down replica set aws-load-balancer-controller-6d59ff6457 to 1 from 2
Normal ScalingReplicaSet 5m55s deployment-controller Scaled up replica set aws-load-balancer-controller-b55586fb8 to 2 from 1
Normal ScalingReplicaSet 5m3s deployment-controller Scaled down replica set aws-load-balancer-controller-6d59ff6457 to 0 from 1
kubectl -n kube-system describe endpoints/aws-load-balancer-webhook-service
Name: aws-load-balancer-webhook-service
Namespace: kube-system
Labels: app.kubernetes.io/component=webhook
app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.7.0
helm.sh/chart=aws-load-balancer-controller-1.7.0
prometheus.io/service-monitor=false
Annotations: <none>
Subsets:
Addresses: 100.68.90.208,100.68.90.24
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
webhook-server 9443 TCP
The logs shows it still stuck at this for more than 24 hours now; Would it impact the actual deployment of a load balancer?
I0210 23:42:56.770284 1 leaderelection.go:248] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...
I have this same issue with releases of the aws-load-balancer-controller
chart >= 1.7
. i pinned the load balancer controller pod image to v2.4.2
in both cases, so this is likely a chart level manifest/configuration issue imo
solution 1: downgrading the chart to 1.6.2
fixed it for me.
solution 2: Upgrading the image.tag
argument that pins the load balancer controller image to >=2.7
while keeping the chart version >=1.7
also works.
solution 3: (untested): setting the readinessProbe
helm chart value to empty might also work (See below why)
a bit more detail on the problem description (for my case) and why the solutions work:
it seems like the pods belonging to the load balancer controller deployment arent coming up for some reason - they are failing their readiness probes in the charts >= 1.7:
NAME READY STATUS RESTARTS AGE
aws-load-balancer-controller-9b866d9c-4n665 0/1 Running 0 6m36s
aws-load-balancer-controller-9b866d9c-fxzsz 0/1 Running 0 6m36s
Pod logs
{"level":"info","ts":1713346839.8052669,"msg":"version","GitVersion":"v2.4.2","GitCommit":"77370be7f8e13787a3ec0cfa99de1647010f1055","BuildDate":"2022-05-24T22:33:27+0000"}
{"level":"info","ts":1713346839.862494,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1713346839.8755138,"logger":"setup","msg":"adding health check for controller"}
{"level":"info","ts":1713346839.8758347,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-v1-pod"}
{"level":"info","ts":1713346839.876017,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":1713346839.8762178,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
{"level":"info","ts":1713346839.8763244,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-networking-v1-ingress"}
{"level":"info","ts":1713346839.876444,"logger":"setup","msg":"starting podInfo repo"}
I0417 09:40:41.877204 1 leaderelection.go:243] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...
{"level":"info","ts":1713346841.877265,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1713346841.8773456,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1713346841.877691,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1713346841.877793,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1713346841.8782332,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
load balancer controller pod description:
Name: aws-load-balancer-controller-9b866d9c-4n665
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: aws-load-balancer-controller
Node: ip-10-0-60-11.eu-west-2.compute.internal/10.0.60.11
Start Time: Wed, 17 Apr 2024 09:40:39 +0000
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
pod-template-hash=9b866d9c
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Status: Running
IP: 10.0.43.232
IPs:
IP: 10.0.43.232
Controlled By: ReplicaSet/aws-load-balancer-controller-9b866d9c
Containers:
aws-load-balancer-controller:
Container ID: containerd://e42b03eede8cd18a7912b80e9779c638b5817812eb4b9788800f58636d6d2135
Image: public.ecr.aws/eks/aws-load-balancer-controller:v2.4.2
Image ID: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller@sha256:321e6ff4b55a0bb8afc090cd2f7ef5b0be8cd356e407ce525ee8b24866382808
Ports: 9443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--cluster-name=mlops-demo
--ingress-class=alb
State: Running
Started: Wed, 17 Apr 2024 09:40:39 +0000
Ready: False
Restart Count: 0
Liveness: http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
Readiness: http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
Environment:
AWS_STS_REGIONAL_ENDPOINTS: regional
AWS_DEFAULT_REGION: eu-west-2
AWS_REGION: eu-west-2
AWS_ROLE_ARN: arn:aws:iam::743582000746:role/aws-load-balancer-controller
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2d6k5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
cert:
Type: Secret (a volume populated by a Secret)
SecretName: aws-load-balancer-tls
Optional: false
kube-api-access-2d6k5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned kube-system/aws-load-balancer-controller-9b866d9c-4n665 to ip-10-0-60-11.eu-west-2.compute.internal
Normal Pulled 10m kubelet Container image "public.ecr.aws/eks/aws-load-balancer-controller:v2.4.2" already present on machine
Normal Created 10m kubelet Created container aws-load-balancer-controller
Normal Started 10m kubelet Started container aws-load-balancer-controller
Warning Unhealthy 4m54s (x35 over 9m54s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 404
as a result, no services (regardless of load balancer type) can be created (as is expected it seems - outlined in this issue) when trying to apply either ingress or generic service manifests.
kubectl apply -f eks-cluster-w-alb/manifests/alb-example/ # deploy remaining stack
deployment.apps/echoserver unchanged
namespace/alb-example unchanged
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/ingress.yaml": Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1-ingress?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/service.yaml": Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
Error from server (InternalError): error when creating "eks-cluster-w-alb/manifests/alb-example/test-service.yaml": Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
i've noticed this change introducing a readiness probe for the controller docker image made in what looks like the release of the 1.7.0:
This readiness probe seems to only be supported with controller images >=2.7
. in other words, the following combinations of (chart version, controller image version) seem to be compatible:
- (<1.7, *)
- (>=1.7, >=2.7)
- you might be able to use <2.7 controller image versions if you disable the
readinessProbe
helm value, removing the unsupported probe endpoint for older images
- you might be able to use <2.7 controller image versions if you disable the
I encountered the same issue. After reviewing @SebastianScherer88 ’s solution, it seemed to use versions 1.8 and 2.6. I decided to upgrade to version 2.7, and it resolved the problem. Thanks!