aws/aws-node-termination-handler

Container stuck in CrashLoopBackOff when deployed in ap-southeast-5

ridzuan5757 opened this issue · 4 comments

Describe the bug
aws-node-termination-handler stuck in CrashLoopBackOff when deployed in AWS Malaysia region (ap-southeast-5).

Steps to reproduce
Kubernetes is deployed using Kops using the following command:

kops create cluster --node-count 3 --control-plane-count 3 --control-plane-size t3.medium --node-size t3.medium --control-plane-zones ap-southeast-5a --zones ap-southeast-5a,ap-southeast-5b,ap-southeast-5c

kops update cluster --yes --admin

Expected outcome
Containers running normally as deployed in other regions.

Application Logs
This is the log from kubectl describe pod

Name:                 aws-node-termination-handler-7d56b6d497-5qp92
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      aws-node-termination-handler
Node:                 i-06965db02543d103c/172.20.1.196
Start Time:           Sun, 15 Sep 2024 04:48:46 +0800
Labels:               app.kubernetes.io/component=deployment
                      app.kubernetes.io/instance=aws-node-termination-handler
                      app.kubernetes.io/name=aws-node-termination-handler
                      k8s-app=aws-node-termination-handler
                      kops.k8s.io/managed-by=kops
                      kops.k8s.io/nth-mode=sqs
                      kubernetes.io/os=linux
                      pod-template-hash=7d56b6d497
Annotations:          <none>
Status:               Running
IP:                   172.20.1.196
IPs:
  IP:           172.20.1.196
Controlled By:  ReplicaSet/aws-node-termination-handler-7d56b6d497
Containers:
  aws-node-termination-handler:
    Container ID:   containerd://f80173b633fd5d2d1fc1cf30efdd959b82443b3fadd567439c1bdc98940b16e0
    Image:          public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
    Image ID:       public.ecr.aws/aws-ec2/aws-node-termination-handler@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5
    Ports:          8080/TCP, 9092/TCP
    Host Ports:     8080/TCP, 9092/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Sep 2024 04:52:08 +0800
      Finished:     Sun, 15 Sep 2024 04:52:08 +0800
    Ready:          False
    Restart Count:  5
    Requests:
      cpu:     50m
      memory:  64Mi
    Liveness:  http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      NODE_NAME:                                 (v1:spec.nodeName)
      POD_NAME:                                 aws-node-termination-handler-7d56b6d497-5qp92 (v1:metadata.name)
      NAMESPACE:                                kube-system (v1:metadata.namespace)
      ENABLE_PROBES_SERVER:                     true
      PROBES_SERVER_PORT:                       8080
      PROBES_SERVER_ENDPOINT:                   /healthz
      LOG_LEVEL:                                info
      JSON_LOGGING:                             true
      LOG_FORMAT_VERSION:                       2
      ENABLE_PROMETHEUS_SERVER:                 false
      PROMETHEUS_SERVER_PORT:                   9092
      CHECK_TAG_BEFORE_DRAINING:                true
      MANAGED_TAG:                              aws-node-termination-handler/managed
      USE_PROVIDER_ID:                          true
      DRY_RUN:                                  false
      CORDON_ONLY:                              false
      TAINT_NODE:                               false
      EXCLUDE_FROM_LOAD_BALANCERS:              true
      DELETE_LOCAL_DATA:                        true
      IGNORE_DAEMON_SETS:                       true
      POD_TERMINATION_GRACE_PERIOD:             -1
      NODE_TERMINATION_GRACE_PERIOD:            120
      EMIT_KUBERNETES_EVENTS:                   true
      COMPLETE_LIFECYCLE_ACTION_DELAY_SECONDS:  -1
      ENABLE_SQS_TERMINATION_DRAINING:          true
      QUEUE_URL:                                https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth
      DELETE_SQS_MSG_IF_NODE_NOT_FOUND:         false
      WORKERS:                                  10
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45qzm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-45qzm:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node-role.kubernetes.io/control-plane op=Exists
                              node-role.kubernetes.io/master op=Exists
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
                              topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=aws-node-termination-handler,app.kubernetes.io/name=aws-node-termination-handler,kops.k8s.io/nth-mode=sqs
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m6s                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         3m27s                 default-scheduler  Successfully assigned kube-system/aws-node-termination-handler-7d56b6d497-5qp92 to i-06965db02543d103c
  Normal   Pulling           3m27s                 kubelet            Pulling image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5"
  Normal   Pulled            3m9s                  kubelet            Successfully pulled image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" in 16.811s (17.697s including waiting). Image size: 16516861 bytes.
  Normal   Started           2m20s (x4 over 3m9s)  kubelet            Started container aws-node-termination-handler
  Warning  BackOff           108s (x10 over 3m7s)  kubelet            Back-off restarting failed container aws-node-termination-handler in pod aws-node-termination-handler-7d56b6d497-5qp92_kube-system(0441621b-8f9a-45ca-9d22-4209fd83d2b8)
  Normal   Created           96s (x5 over 3m9s)    kubelet            Created container aws-node-termination-handler
  Normal   Pulled            96s (x4 over 3m8s)    kubelet            Container image "public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.22.0@sha256:fc4b883511887a535a48c8918735be97c15f8f67c66e6aca869ec051091df6a5" already present on machine

This is the log from kubectl get log

{"level":"info","time":"2024-09-14T20:52:08Z","message":"Using log format version 2"}
{"level":"info","dry_run":false,"node_name":"i-06965db02543d103c","pod_name":"aws-node-termination-handler-7d56b6d497-5qp92","pod_namespace":"kube-system","metadata_url":"http://169.254.169.254","kubernetes_service_host":"100.64.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":true,"delete_sqs_msg_if_node_not_found":false,"enable_rebalance_monitoring":false,"enable_rebalance_draining":false,"metadata_tries":3,"cordon_only":false,"taint_node":false,"taint_effect":"NoSchedule","exclude_from_load_balancers":true,"json_logging":true,"log_level":"info","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":false,"prometheus_server_port":9092,"emit_kubernetes_events":true,"kubernetes_events_extra_annotations":"","aws_region":"","aws_endpoint":"","queue_url":"https://sqs.ap-southeast-5.amazonaws.com/715841329405/monitoring-shell-ronpos-com-nth","check_tag_before_draining":true,"ManagedTag":"aws-node-termination-handler/managed","use_provider_id":true,"time":"2024-09-14T20:52:08Z","message":"aws-node-termination-handler arguments"}
{"level":"fatal","time":"2024-09-14T20:52:08Z","message":"Unable to find the AWS region to process queue events."}

Environment

  • NTH App Version: v1.22.0
  • NTH Mode (IMDS/Queue processor):
  • OS/Arch: Ubuntu 22.04.4 LTS/Intel
  • Kubernetes version: v1.30.2
  • Installation method: Kops version 1.30.0

Hi @ridzuan5757 thank you for raising this issue. Unfortunately, NTH v1 does not support ap-southeast-5. We have a separate unreleased branch called NTH v2 that you can try using which should work in that region.

Starting a thread here on what the future fix would be: we would need to update aws-sdk-go to v2 as that has the most up to date region information. All the region information provided by aws-sdk-go v1 is outdated (hence NTH failure in ap-southeast-5). We are unsure when we might be able to get to this issue as our team has limited bandwidth, but we welcome contributions if anybody is interested in starting on a fix.