Spot instance termination not being handled?
evandam opened this issue · 4 comments
Describe the bug
We are running the node termination handler on EKS for spot instance node groups, but not seeing any nodes being drained/cordoned in logs or webhooks we have configured.
We have a Slack webhook configured, and are using the IMDS Daemonset mode installed with Helm.
Steps to reproduce
helm upgrade -i \
aws-node-termination-handler \
eks/aws-node-termination-handler \
--set webhookURL=https://hooks.slack.com/services/my/webhook
Expected outcome
Slack notifications when spot instances are interrupted and corresponding logs from the daemonset when a node is terminating.
Application Logs
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF Startup Metadata Retrieved metadata={"accountId":"123456789012","availabilityZone":"us-west-2c","instanceId":"i-0eab4f0b15775428a","instanceLifeCycle":"spot","instanceType":"m6a.xlarge","localHostname":"ip-10-22-43-50.us-west-2.compute.internal","privateIp":"10.22.43.50","publicHostname":"","publicIp":"","region":"us-west-2"}
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF aws-node-termination-handler arguments:
05:51:40.545 dev aws-node-termination-handler dry-run: false,
05:51:40.545 dev aws-node-termination-handler node-name: ip-10-22-43-50.us-west-2.compute.internal,
05:51:40.545 dev aws-node-termination-handler pod-name: aws-node-termination-handler-ltslg,
05:51:40.545 dev aws-node-termination-handler metadata-url: http://169.254.169.254,
05:51:40.545 dev aws-node-termination-handler kubernetes-service-host: 172.22.0.1,
05:51:40.545 dev aws-node-termination-handler kubernetes-service-port: 443,
05:51:40.545 dev aws-node-termination-handler delete-local-data: true,
05:51:40.545 dev aws-node-termination-handler ignore-daemon-sets: true,
05:51:40.545 dev aws-node-termination-handler pod-termination-grace-period: -1,
05:51:40.545 dev aws-node-termination-handler node-termination-grace-period: 120,
05:51:40.545 dev aws-node-termination-handler enable-scheduled-event-draining: false,
05:51:40.545 dev aws-node-termination-handler enable-spot-interruption-draining: true,
05:51:40.545 dev aws-node-termination-handler enable-sqs-termination-draining: false,
05:51:40.545 dev aws-node-termination-handler enable-rebalance-monitoring: false,
05:51:40.545 dev aws-node-termination-handler enable-rebalance-draining: false,
05:51:40.545 dev aws-node-termination-handler metadata-tries: 3,
05:51:40.545 dev aws-node-termination-handler cordon-only: false,
05:51:40.545 dev aws-node-termination-handler taint-node: false,
05:51:40.545 dev aws-node-termination-handler taint-effect: NoSchedule,
05:51:40.545 dev aws-node-termination-handler exclude-from-load-balancers: false,
05:51:40.545 dev aws-node-termination-handler json-logging: false,
05:51:40.545 dev aws-node-termination-handler log-level: info,
05:51:40.545 dev aws-node-termination-handler webhook-proxy: ,
05:51:40.545 dev aws-node-termination-handler webhook-headers: <not-displayed>,
05:51:40.545 dev aws-node-termination-handler webhook-url: <provided-not-displayed>,
05:51:40.545 dev aws-node-termination-handler webhook-template: <not-displayed>,
05:51:40.545 dev aws-node-termination-handler uptime-from-file: /proc/uptime,
05:51:40.545 dev aws-node-termination-handler enable-prometheus-server: false,
05:51:40.545 dev aws-node-termination-handler prometheus-server-port: 9092,
05:51:40.545 dev aws-node-termination-handler emit-kubernetes-events: false,
05:51:40.545 dev aws-node-termination-handler kubernetes-events-extra-annotations: ,
05:51:40.545 dev aws-node-termination-handler aws-region: us-west-2,
05:51:40.545 dev aws-node-termination-handler queue-url: ,
05:51:40.545 dev aws-node-termination-handler check-asg-tag-before-draining: true,
05:51:40.545 dev aws-node-termination-handler managed-asg-tag: aws-node-termination-handler/managed,
05:51:40.545 dev aws-node-termination-handler assume-asg-tag-propagation: false,
05:51:40.545 dev aws-node-termination-handler aws-endpoint: ,
05:51:40.545 dev aws-node-termination-handler
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF Started watching for interruption events
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF Kubernetes AWS Node Termination Handler has started successfully!
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF Started watching for event cancellations
05:51:40.545 dev aws-node-termination-handler 2022/04/19 12:51:40 INF Started monitoring for events event_type=SPOT_ITN
06:21:40.546 dev aws-node-termination-handler 2022/04/19 13:21:40 INF event store statistics drainable-events=0 size=0
06:51:40.546 dev aws-node-termination-handler 2022/04/19 13:51:40 INF event store statistics drainable-events=0 size=0
07:21:40.546 dev aws-node-termination-handler 2022/04/19 14:21:40 INF event store statistics drainable-events=0 size=0
07:51:40.545 dev aws-node-termination-handler 2022/04/19 14:51:40 INF Garbage-collecting the interruption event store
07:51:40.545 dev aws-node-termination-handler 2022/04/19 14:51:40 INF event store statistics drainable-events=0 size=0
08:21:40.545 dev aws-node-termination-handler 2022/04/19 15:21:40 INF event store statistics drainable-events=0 size=0
08:51:40.546 dev aws-node-termination-handler 2022/04/19 15:51:40 INF event store statistics drainable-events=0 size=0
09:21:40.546 dev aws-node-termination-handler 2022/04/19 16:21:40 INF event store statistics drainable-events=0 size=0
09:51:40.546 dev aws-node-termination-handler 2022/04/19 16:51:40 INF Garbage-collecting the interruption event store
09:51:40.546 dev aws-node-termination-handler 2022/04/19 16:51:40 INF event store statistics drainable-events=0 size=0
10:21:40.546 dev aws-node-termination-handler 2022/04/19 17:21:40 INF event store statistics drainable-events=0 size=0
10:51:40.545 dev aws-node-termination-handler 2022/04/19 17:51:40 INF event store statistics drainable-events=0 size=0
11:21:40.545 dev aws-node-termination-handler 2022/04/19 18:21:40 INF event store statistics drainable-events=0 size=0
11:51:40.545 dev aws-node-termination-handler 2022/04/19 18:51:40 INF Garbage-collecting the interruption event store
11:51:40.546 dev aws-node-termination-handler 2022/04/19 18:51:40 INF event store statistics drainable-events=0 size=0
Environment
- NTH App Version: 0.18.1
- NTH Mode (IMDS/Queue processor): IMDS
- OS/Arch: Amazon Linux 2 / amd64
- Kubernetes version: 1.21
- Installation method: Helm
I'm seeing the same issue actually on my end
@evandam @scalp42 can you share full reproduction steps for this issue?
I've been trying to reproduce this issue, but have not been successful. To test, I've set up a new EKS cluster with spot nodes in a self-managed node group, with a new Slack webhook as a target. NTH in IMDS mode installed through Helm with the README defaults:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining="true" \
--set enableRebalanceMonitoring="false" \
--set enableScheduledEventDraining="false" \
--set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL \
eks/aws-node-termination-handler
Below are logs from an interrupted node. I do see the Spot Instance interruptions being picked up by NTH, and the webhook notification being sent.
~ kubectl logs pods/aws-node-termination-handler-2hrkl -n kube-system -f
2022/04/26 02:28:02 INF Startup Metadata Retrieved metadata={"accountId":"12345678","availabilityZone":"us-east-2a","instanceId":"i-0ccd8fba93d77b39c","instanceLifeCycle":"spot","instanceType":"t3.medium","localHostname":"ip-x-x-x-x.us-east-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"ec2-x-x-x-x.us-east-2.compute.amazonaws.com","publicIp":"x.x.x.x","region":"us-east-2"}
2022/04/26 02:28:02 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: ip-x-x-x-x.us-east-2.compute.internal,
pod-name: aws-node-termination-handler-2hrkl,
metadata-url: http://x-x-x-x,
kubernetes-service-host: x-x-x-x,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: true,
enable-sqs-termination-draining: false,
enable-rebalance-monitoring: false,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: false,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: <not-displayed>,
webhook-url: <provided-not-displayed>,
webhook-template: <not-displayed>,
uptime-from-file: /proc/uptime,
enable-prometheus-server: false,
prometheus-server-port: 9092,
emit-kubernetes-events: false,
kubernetes-events-extra-annotations: ,
aws-region: us-east-2,
queue-url: ,
check-asg-tag-before-draining: true,
managed-asg-tag: aws-node-termination-handler/managed,
assume-asg-tag-propagation: false,
use-provider-id: false,
aws-endpoint: ,
2022/04/26 02:28:02 INF Started watching for interruption events
2022/04/26 02:28:02 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/04/26 02:28:02 INF Started watching for event cancellations
2022/04/26 02:28:02 INF Started monitoring for events event_type=SPOT_ITN
2022/04/26 02:45:54 INF Adding new event to the event store event={"AutoScalingGroupName":"","Description":"Spot ITN received. Instance will be interrupted at 2022-04-26T02:47:52Z \n","EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-4eabdd13ca3a0eab64dasa7700cd7788fa60ed0107d400077ce4b08dea526b","InProgress":false,"InstanceID":"","IsManaged":false,"Kind":"SPOT_ITN","NodeLabels":null,"NodeName":"ip-x-x-x-x.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"","StartTime":"2022-04-26T02:47:52Z","State":""}
2022/04/26 02:45:55 INF Requesting instance drain event-id=spot-itn-4eabdd13ca3a0ea4dda7700cd7788fa60ed0107d400077ce4b08dea526bb8f instance-id= kind=SPOT_ITN node-name=ip-x-x-x-x.us-east-2.compute.internal provider-id=
2022/04/26 02:45:55 INF Pods on node node_name=ip-x-x-x-x.us-east-2.compute.internal pod_names=["aws-node-lflw5","aws-node-termination-handler-2hrkl","kube-proxy-fnc5h"]
2022/04/26 02:45:55 INF Draining the node
2022/04/26 02:45:55 ??? WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-lflw5, kube-system/aws-node-termination-handler-2hrkl, kube-system/kube-proxy-fnc5h
2022/04/26 02:45:55 INF Node successfully cordoned and drained node_name=ip-x-x-x-x.us-east-2.compute.internal reason="Spot ITN received. Instance will be interrupted at 2022-04-26T02:47:52Z \n"
2022/04/26 02:45:55 INF Webhook Success: Notification Sent!
Can you help confirm a few things with your setup?
- For the Autoscaling Group, checking the Tags, there should be a key | value pair of
kubernetes.io/cluster/<cluster-name> | owned
to indicate a self-managed node group. - In the Instance view of the AWS EC2 console, under Details, the
Lifecycle
value should bespot
.
Hey @AustinSiu I did confirm everything you recommended, but think I identified what's going on. Due to some unrelated issues our spot instances are actually being terminated mostly due to AZ rebalancing events, not spot interruptions.
I redeployed NTH with rebalance monitoring enabled and am seeing webhooks coming through.
I'll reopen if needed but think that did the trick. Thanks!