AWS Node Termination Handler
Gracefully handle EC2 instance shutdown within Kubernetes
Project Summary
This project ensures that the Kubernetes control plane responds appropriately to events that can cause your EC2 instance to become unavailable, such as EC2 maintenance events, EC2 Spot interruptions, ASG Scale-In, ASG AZ Rebalance, and EC2 Instance Termination via the API or Console. If not handled, your application code may not stop gracefully, take longer to recover full availability, or accidentally schedule work to nodes that are going down.
The aws-node-termination-handler (NTH) can operate in two different modes: Instance Metadata Service (IMDS) or the Queue Processor.
The aws-node-termination-handler Instance Metadata Service Monitor will run a small pod on each host to perform monitoring of IMDS paths like /spot
or /events
and react accordingly to drain and/or cordon the corresponding node.
The aws-node-termination-handler Queue Processor will monitor an SQS queue of events from Amazon EventBridge for ASG lifecycle events, EC2 status change events, and Spot Interruption Termination Notice events. When NTH detects an instance is going down, we use the Kubernetes API to cordon the node to ensure no new work is scheduled there, then drain it, removing any existing work. The termiantion handler Queue Processor requires AWS IAM permissions to monitor and manage the SQS queue and to query the EC2 API. The queue processor mode is currently in a beta preview, but we'd love your feedback on it!
You can run the termination handler on any Kubernetes cluster running on AWS, including self-managed clusters and those created with Amazon Elastic Kubernetes Service.
Major Features
Instance Metadata Service Processor
- Monitors EC2 Metadata for Scheduled Maintenance Events
- Monitors EC2 Metadata for Spot Instance Termination Notifications
- Monitors EC2 Metadata for Rebalance Recommendation Notifications
- Helm installation and event configuration support
- Webhook feature to send shutdown or restart notification messages
- Unit & Integration Tests
Queue Processor
- Monitors an SQS Queue for:
- EC2 Spot Interruption Notifications
- EC2 Instance Rebalance Recommendation
- EC2 Auto-Scaling Group Termination Lifecycle Hooks to take care of ASG Scale-In, AZ-Rebalance, Unhealthy Instances, and more!
- EC2 Status Change Events
- Helm installation and event configuration support
- Webhook feature to send shutdown or restart notification messages
- Unit & Integration Tests
Which one should I use?
Feature | IMDS Processor | Queue Processor |
---|---|---|
K8s DaemonSet | ✅ | ❌ |
K8s Deployment | ❌ | ✅ |
Spot Instance Interruptions (ITN) | ✅ | ✅ |
Scheduled Events | ✅ | ✅ |
EC2 Instance Rebalance Recommendation | ✅ | ✅ |
ASG Lifecycle Hooks | ❌ | ✅ |
EC2 Status Changes | ❌ | ✅ |
Setup Required | ❌ | ✅ |
Installation and Configuration
AWS Node Termination Handler - IMDS Processor
Installation and Configuration
The termination handler DaemonSet installs into your cluster a ServiceAccount, ClusterRole, ClusterRoleBinding, and a DaemonSet. All four of these Kubernetes constructs are required for the termination handler to run properly.
Kubectl Apply
You can use kubectl to directly add all of the above resources with the default configuration into your cluster.
kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.9.0/all-resources.yaml
For a full list of releases and associated artifacts see our releases page.
Helm
The easiest way to configure the various options of the termination handler is via helm. The chart for this project is hosted in the eks-charts repository.
To get started you need to add the eks-charts repo to helm
helm repo add eks https://aws.github.io/eks-charts
Once that is complete you can install the termination handler. We've provided some sample setup options below.
Zero Config:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
eks/aws-node-termination-handler
Enabling Features:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining="true" \
--set enableRebalanceMonitoring="true" \
--set enableScheduledEventDraining="false" \
eks/aws-node-termination-handler
Running Only On Specific Nodes:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set nodeSelector.lifecycle=spot \
eks/aws-node-termination-handler
Webhook Configuration:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL \
eks/aws-node-termination-handler
Alternatively, pass Webhook URL as a Secret:
WEBHOOKURL_LITERAL="webhookurl=https://hooks.slack.com/services/YOUR/SLACK/URL"
kubectl create secret -n kube-system generic webhooksecret --from-literal=$WEBHOOKURL_LITERAL
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set webhookURLSecretName=webhooksecret \
eks/aws-node-termination-handler
For a full list of configuration options see our Helm readme.
AWS Node Termination Handler - Queue Processor (requires AWS IAM Permissions)
NOTE: THIS FUNCTIONALITY IS CURRENTLY IN BETA
Infrastructure Setup
The termination handler deployment requires some infrastructure to be setup before deploying the application. You'll need the following AWS infrastructure components:
- AutoScaling Group Termination Lifecycle Hook
- Amazon Simple Queue Service (SQS) Queue
- Amazon EventBridge Rule
- IAM Role for the aws-node-termination-handler Queue Processing Pods
1. Setup a Termination Lifecycle Hook on an ASG:
Here is the AWS CLI command to create a termination lifecycle hook on an existing ASG, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
$ aws autoscaling put-lifecycle-hook \
--lifecycle-hook-name=my-k8s-term-hook \
--auto-scaling-group-name=my-k8s-asg \
--lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
--default-result=CONTINUE \
--heartbeat-timeout=300
2. Tag the ASGs:
By default the aws-node-termination-handler will only manage terminations for ASGs tagged w/ key=aws-node-termination-handler/managed
$ aws autoscaling create-or-update-tags \
--tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=aws-node-termination-handler/managed,Value=,PropagateAtLaunch=true
The value of the key does not matter.
This functionality is helpful in accounts where there are ASGs that do not run kubernetes nodes or you do not want aws-node-termination-handler to manage their termination lifecycle.
However, if your account is dedicated to ASGs for your kubernetes cluster, then you can turn off the ASG tag check by setting the flag --check-asg-tag-before-draining=false
or environment variable CHECK_ASG_TAG_BEFORE_DRAINING=false
.
You can also control what resources NTH manages by adding the resource ARNs to your Amazon EventBridge rules.
Take a look at the docs on how to create rules that only manage certain ASGs here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html
See all the different events docs here: https://docs.aws.amazon.com/eventbridge/latest/userguide/event-types.html#auto-scaling-event-types
3. Create an SQS Queue:
Here is the AWS CLI command to create an SQS queue to hold termination events from ASG and EC2, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
## Queue Policy
$ QUEUE_POLICY=$(cat <<EOF
{
"Version": "2012-10-17",
"Id": "MyQueuePolicy",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Service": ["events.amazonaws.com", "sqs.amazonaws.com"]
},
"Action": "sqs:SendMessage",
"Resource": [
"arn:aws:sqs:${AWS_REGION}:${ACCOUNT_ID}:${SQS_QUEUE_NAME}"
]
}]
}
EOF
)
## make sure the queue policy is valid JSON
$ echo "$QUEUE_POLICY" | jq .
## Save queue attributes to a temp file
$ cat << EOF > /tmp/queue-attributes.json
{
"MessageRetentionPeriod": "300",
"Policy": "$(echo $QUEUE_POLICY | sed 's/\"/\\"/g')"
}
EOF
$ aws sqs create-queue --queue-name "${SQS_QUEUE_NAME}" --attributes file:///tmp/queue-attributes.json
4. Create an Amazon EventBridge Rule
Here is the AWS CLI command to create an Amazon EventBridge rule so that ASG termination events are sent to the SQS queue created in the previous step. This should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
$ aws events put-rule \
--name MyK8sASGTermRule \
--event-pattern "{\"source\":[\"aws.autoscaling\"],\"detail-type\":[\"EC2 Instance-terminate Lifecycle Action\"]}"
$ aws events put-targets --rule MyK8sASGTermRule \
--targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"
$ aws events put-rule \
--name MyK8sSpotTermRule \
--event-pattern "{\"source\": [\"aws.ec2\"],\"detail-type\": [\"EC2 Spot Instance Interruption Warning\"]}"
$ aws events put-targets --rule MyK8sSpotTermRule \
--targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"
$ aws events put-rule \
--name MyK8sRebalanceRule \
--event-pattern "{\"source\": [\"aws.ec2\"],\"detail-type\": [\"EC2 Instance Rebalance Recommendation\"]}"
$ aws events put-targets --rule MyK8sRebalanceRule \
--targets "Id"="1","Arn"="arn:aws:sqs:us-east-1:123456789012:MyK8sTermQueue"
5. Create an IAM Role for the Pods
There are many different ways to allow the aws-node-termination-handler pods to assume a role:
IAM Policy for aws-node-termination-handler Deployment:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:CompleteLifecycleAction",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeTags",
"ec2:DescribeInstances",
"sqs:DeleteMessage",
"sqs:ReceiveMessage"
],
"Resource": "*"
}
]
}
Installation
Kubectl Apply
You can use kubectl to directly add all of the above resources with the default configuration into your cluster.
kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.9.0/all-resources-queue-processor.yaml
For a full list of releases and associated artifacts see our releases page.
Helm
The easiest way to configure the various options of the termination handler is via helm. The chart for this project is hosted in the eks-charts repository.
To get started you need to add the eks-charts repo to helm
helm repo add eks https://aws.github.io/eks-charts
Once that is complete you can install the termination handler. We've provided some sample setup options below.
Minimal Config:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
eks/aws-node-termination-handler
Webhook Configuration:
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
--set webhookURL=https://hooks.slack.com/services/YOUR/SLACK/URL \
eks/aws-node-termination-handler
Alternatively, pass Webhook URL as a Secret:
WEBHOOKURL_LITERAL="webhookurl=https://hooks.slack.com/services/YOUR/SLACK/URL"
kubectl create secret -n kube-system generic webhooksecret --from-literal=$WEBHOOKURL_LITERAL
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set queueURL=https://sqs.us-east-1.amazonaws.com/0123456789/my-term-queue \
--set webhookURLSecretName=webhooksecret \
eks/aws-node-termination-handler
For a full list of configuration options see our Helm readme.
Use with Kiam
Use with Kiam
To use the termination handler alongside Kiam requires some extra configuration on Kiam's end. By default Kiam will block all access to the metadata address, so you need to make sure it passes through the requests the termination handler relies on.
To add a whitelist configuration, use the following fields in the Kiam Helm chart values:
agent.whiteListRouteRegexp: '^\/latest\/meta-data\/(spot\/instance-action|events\/maintenance\/scheduled|instance-(id|type)|public-(hostname|ipv4)|local-(hostname|ipv4)|placement\/availability-zone)|\/latest\/dynamic\/instance-identity\/document$'
Or just pass it as an argument to the kiam agents:
kiam agent --whitelist-route-regexp='^\/latest\/meta-data\/(spot\/instance-action|events\/maintenance\/scheduled|instance-(id|type)|public-(hostname|ipv4)|local-(hostname|ipv4)|placement\/availability-zone)|\/latest\/dynamic\/instance-identity\/document$'
Metadata endpoints
The termination handler relies on the following metadata endpoints to function properly:
/latest/dynamic/instance-identity/document
/latest/meta-data/spot/instance-action
/latest/meta-data/events/recommendations/rebalance
/latest/meta-data/events/maintenance/scheduled
/latest/meta-data/instance-id
/latest/meta-data/instance-type
/latest/meta-data/public-hostname
/latest/meta-data/public-ipv4
/latest/meta-data/local-hostname
/latest/meta-data/local-ipv4
/latest/meta-data/placement/availability-zone
Building
For build instructions please consult BUILD.md.
Communication
- If you've run into a bug or have a new feature request, please open an issue.
- You can also chat with us in the Kubernetes Slack in the
#provider-aws
channel - Check out the open source Amazon EC2 Spot Instances Integrations Roadmap to see what we're working on and give us feedback!
Contributing
Contributions are welcome! Please read our guidelines and our Code of Conduct
License
This project is licensed under the Apache-2.0 License.