Container stuck in CrashLoopBackOff when deployed in ap-southeast-5
ridzuan5757 opened this issue · 4 comments
Describe the bug
aws-node-termination-handler
stuck in CrashLoopBackOff
when deployed in AWS Malaysia region (ap-southeast-5
).
Steps to reproduce
Kubernetes is deployed using Kops using the following command:
kops create cluster --node-count 3 --control-plane-count 3 --control-plane-size t3.medium --node-size t3.medium --control-plane-zones ap-southeast-5a --zones ap-southeast-5a,ap-southeast-5b,ap-southeast-5c
kops update cluster --yes --admin
Expected outcome
Containers running normally as deployed in other regions.
Application Logs
This is the log from kubectl describe pod
Name: aws-node-termination-handler-7d56b6d497-5qp92
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: aws-node-termination-handler
Node: i-06965db02543d103c/172.20.1.196
Start Time: Sun, 15 Sep 2024 04:48:46 +0800
Labels: app.kubernetes.io/component=deployment
app.kubernetes.io/instance=aws-node-termination-handler
app.kubernetes.io/name=aws-node-termination-handler
k8s-app=aws-node-termination-handler
kops.k8s.io/managed-by=kops
kops.k8s.io/nth-mode=sqs
kubernetes.io/os=linux
pod-template-hash=7d56b6d497
Annotations: <none>
Status: Running
IP: 172.20.1.196
IPs:
IP: 172.20.1.196
Controlled By: ReplicaSet/aws-node-termination-handler-7d56b6d497
Containers:
aws-node-termination-handler:
Container ID: containerd://f80173b633fd5d2d1fc1cf30efdd959b82443b3fadd567439c1bdc98940b16e0
Image: public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
Image ID: public.ecr.aws/aws-ec2/aws-node-termination-handler@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
Ports: 8080/TCP, 9092/TCP
Host Ports: 8080/TCP, 9092/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 15 Sep 2024 04:52:08 +0800
Finished: Sun, 15 Sep 2024 04:52:08 +0800
Ready: False
Restart Count: 5
Requests:
cpu: 50m
memory: 64Mi
Liveness: http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
Environment:
NODE_NAME: (v1:spec.nodeName)
POD_NAME: aws-node-termination-handler-7d56b6d497-5qp92 (v1:metadata.name)
NAMESPACE: kube-system (v1:metadata.namespace)
ENABLE_PROBES_SERVER: true
PROBES_SERVER_PORT: 8080
PROBES_SERVER_ENDPOINT: /healthz
LOG_LEVEL: info
JSON_LOGGING: true
LOG_FORMAT_VERSION: 2
ENABLE_PROMETHEUS_SERVER: false
PROMETHEUS_SERVER_PORT: 9092
CHECK_TAG_BEFORE_DRAINING: true
MANAGED_TAG: aws-node-termination-handler/managed
USE_PROVIDER_ID: true
DRY_RUN: false
CORDON_ONLY: false
TAINT_NODE: false
EXCLUDE_FROM_LOAD_BALANCERS: true
DELETE_LOCAL_DATA: true
IGNORE_DAEMON_SETS: true
POD_TERMINATION_GRACE_PERIOD: -1
NODE_TERMINATION_GRACE_PERIOD: 120
EMIT_KUBERNETES_EVENTS: true
COMPLETE_LIFECYCLE_ACTION_DELAY_SECONDS: -1
ENABLE_SQS_TERMINATION_DRAINING: true
QUEUE_URL: https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth
DELETE_SQS_MSG_IF_NODE_NOT_FOUND: false
WORKERS: 10
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45qzm (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-45qzm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/control-plane op=Exists
node-role.kubernetes.io/master op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints: kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m6s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Normal Scheduled 3m27s default-scheduler Successfully assigned kube-system/aws-node-termination-handler-7d56b6d497-5qp92 to i-06965db02543d103c
Normal Pulling 3m27s kubelet Pulling image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5"
Normal Pulled 3m9s kubelet Successfully pulled image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" in 16.811s (17.697s including waiting). Image size: 16516861 bytes.
Normal Started 2m20s (x4 over 3m9s) kubelet Started container aws-node-termination-handler
Warning BackOff 108s (x10 over 3m7s) kubelet Back-off restarting failed container aws-node-termination-handler in pod aws-node-termination-handler-7d56b6d497-5qp92_kube-system(0441621b-8f9a-45ca-9d22-4209fd83d2b8)
Normal Created 96s (x5 over 3m9s) kubelet Created container aws-node-termination-handler
Normal Pulled 96s (x4 over 3m8s) kubelet Container image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" already present on machine
This is the log from kubectl get log
{"level":"info","time":"2024-09-14T20:52:08Z","message":"Using log format version 2"}
{"level":"info","dry_run":false,"node_name":"i-06965db02543d103c","pod_name":"aws-node-termination-handler-7d56b6d497-5qp92","pod_namespace":"kube-system","metadata_url":"http://169.254.169.254","kubernetes_service_host":"100.64.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":true,"delete_sqs_msg_if_node_not_found":false,"enable_rebalance_monitoring":false,"enable_rebalance_draining":false,"metadata_tries":3,"cordon_only":false,"taint_node":false,"taint_effect":"NoSchedule","exclude_from_load_balancers":true,"json_logging":true,"log_level":"info","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":false,"prometheus_server_port":9092,"emit_kubernetes_events":true,"kubernetes_events_extra_annotations":"","aws_region":"","aws_endpoint":"","queue_url":"https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth","check_tag_before_draining":true,"ManagedTag":"aws-node-termination-handler/managed","use_provider_id":true,"time":"2024-09-14T20:52:08Z","message":"aws-node-termination-handler arguments"}
{"level":"fatal","time":"2024-09-14T20:52:08Z","message":"Unable to find the AWS region to process queue events."}
Environment
- NTH App Version: v1.22.0
- NTH Mode (IMDS/Queue processor):
- OS/Arch: Ubuntu 22.04.4 LTS/Intel
- Kubernetes version: v1.30.2
- Installation method: Kops version 1.30.0
Hi @ridzuan5757 thank you for raising this issue. Unfortunately, NTH v1 does not support ap-southeast-5. We have a separate unreleased branch called NTH v2 that you can try using which should work in that region.
Starting a thread here on what the future fix would be: we would need to update aws-sdk-go to v2 as that has the most up to date region information. All the region information provided by aws-sdk-go v1 is outdated (hence NTH failure in ap-southeast-5). We are unsure when we might be able to get to this issue as our team has limited bandwidth, but we welcome contributions if anybody is interested in starting on a fix.