aws/aws-node-termination-handler

TEST_NOTIFICATION event sent to newly added ASG lifecycle hook w/o Event Bridge triggers NTH panic

ngdah opened this issue · 6 comments

ngdah commented

Describe the bug
NTH panics when trying to read from an SQS associated with a newly added ASG lifecycle hook w/o event bridge.

According to the ASG lifecycle hook documentation:

When you first add a lifecycle hook, a test notification message is published to the notification target

This event will be a autoscaling:TEST_NOTIFICATION event, which is not supported by NTH. If there are no ASG terminate event sent, NTH will panic and restart.

Steps to reproduce

  1. Create a new ASG for testing;
  2. Create a new ASG lifecycle hook in the new ASG, providing role ARN and SQS ARN to send message to the queue directly;
  3. Setup NTH to read form the SQS associated with the lifecycle hook.

Expected outcome
NTH should not panic because of TEST_NOTIFICATION event

Application Logs

2022/04/12 00:44:49 INF only lifecycle termination events from ASG to SQS are supported outside EventBridge
2022/04/12 00:44:49 ERR error processing SQS message error="unsupported message type ({\n  Attributes: {\n    SentTimestamp: \"1649720528564\"\n  },\n  Body: \"{\\\"AccountId\\\":\\\"<redacted>\\\",\\\"RequestId\\\":\\\"c3a26f07-6ad0-4fe5-b242-abca74ab7981\\\",\\\"AutoScalingGroupARN\\\":\\\"arn:aws:autoscaling:us-east-1:<redacted>:autoScalingGroup:ffab3e8b-e4e8-45e4-b077-0d13dfe97819:autoScalingGroupName/<redacted>\\\",\\\"AutoScalingGroupName\\\":\\\"<redacted>\\\",\\\"Service\\\":\\\"AWS Auto Scaling\\\",\\\"Event\\\":\\\"autoscaling:TEST_NOTIFICATION\\\",\\\"Time\\\":\\\"2022-04-11T23:42:08.494Z\\\"}\",\n  MD5OfBody: \"6a4f8c60015ee80fcd60c7e215796fdc\",\n  MessageId: \"edd69ff9-4b42-4f0f-9def-5820541589aa\",\n  ReceiptHandle: \"AQEB12e0rtmcUgV9mROTlrqShR95Y9i5dZoHWwWwJWM9XYx0KUeaK3i5khxVFszT19oz7zNoL9B+4dPmjcWC3GHc8940FWCGg1dZhHYOAMC9OaBpZLRthvmrVdLGItgvimwQPZ8Qqi0IEspXnf8jls5BTryACE0bH7UHnXeiKHLk1PR64Ka/1VeEVawPoqVFP6oYENF6IYkVKLNrjoIeyZHshHv5o/UcNpf8uZHQvbcsplBqaUopvbP75qNvXTngVEQK5A1cylT8SDPLno887XiMi5npYJtm3MmW2d7n9iDxtw94TDkK0fpXsrDeEyb41l8iW8C1OA625GJEaZQOh5uRBCogg0MwgCWaX4LV2XU8uTMDrLiCoWrs78FaW1rz8nPAkuP2mTaWPMCCWFWHNbhxD6ddV9fyZ6PmNquJlpD/MRI=\"\n})"
2022/04/12 00:44:49 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2022/04/12 00:44:49 WRN Stopping NTH - Duplicate Error Threshold hit.
panic: none of the waiting queue events could be processed

goroutine 85 [running]:
main.main.func3(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/node-termination-handler/cmd/node-termination-handler.go:211 +0x649
created by main.main
	/node-termination-handler/cmd/node-termination-handler.go:193 +0xd51

Environment

  • NTH App Version: 1.16.1
  • NTH Mode (IMDS/Queue processor): Queue processor
  • OS/Arch: linus/amd64
  • Kubernetes version: 1.20
  • Installation method: kubectl apply

Released as part of v1.16.2.

Observed this with ver 1.16.3

2022/05/19 12:35:17 INF only lifecycle termination events from ASG to SQS are supported outside EventBridge
2022/05/19 12:35:17 ERR error processing SQS message error="unsupported message type ({\n  Attributes: {\n    SentTimestamp: \"1652963523984\"\n  },\n  Body: \"{\\\"AccountId\\\":\\\"<REDACTED>\\\",\\\"RequestId\\\":\\\"de072de4-c55e-4378-a223-7def3c58dc0d\\\",\\\"AutoScalingGroupARN\\\":\\\"arn:aws:autoscaling:us-east-1:<REDACTED>:autoScalingGroup:a0b3503a-0ca4-443b-a64c-d01835aa11c9:autoScalingGroupName/eks-72bce5fc-7e1e-731a-0f6a-e458726aa02f\\\",\\\"AutoScalingGroupName\\\":\\\"eks-72bce5fc-7e1e-731a-0f6a-e458726aa02f\\\",\\\"Service\\\":\\\"AWS Auto Scaling\\\",\\\"Event\\\":\\\"autoscaling:TEST_NOTIFICATION\\\",\\\"Time\\\":\\\"2022-05-19T12:32:03.912Z\\\"}\",\n  MD5OfBody: \"442ab440e6eb5f43b0c41b67b6229656\",\n  MessageId: \"87163343-f682-4e04-a562-3cec206e8730\",\n  ReceiptHandle: \"AQEB3nTlC3Hkx2vcyd42WAWR8X31SP9N0xs0DklguF2U6VY05l09WMhl8rP4KA6yncFjmRaKTuL1+ryXzMs2ZoC7LoaiOUWn9JR/mvMZ7CYKyjHicEIMVVqjw+ys8TVvP+Fwr93B1KzP3NsmLx1B0cEp7ZhjLE9NDAO52/6rvEIsLu8wdA10klgNLQOwS9JKdx4N0pHt75HOGclJ41hL3YEuBwpvxOqbOanaMyhamrZl8/UoV84KWiFXLm1vp806Sl/+y+CB5nqXniZIUVzpK+JZ3+qs63gfJ3ypVdFdqLqEoaR26DDzT43+b9qDYNb3Gt7brHN6Wp0e+bqV6i4f+zeBGqFuFcMg8S5mk8HfaOXuPaT0k/OYFEel7j7kWWnBFXK9NNXRIh2Nddt+jXVLXVkjIt5dvIl9RPWnI5zLF1s+BdQ=\"\n})"
2022/05/19 12:35:17 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2022/05/19 12:35:19 INF only lifecycle termination events from ASG to SQS are supported outside EventBridge
2022/05/19 12:35:19 ERR error processing SQS message error="unsupported message type ({\n  Attributes: {\n    SentTimestamp: \"1652963523993\"\n  },\n  Body: \"{\\\"AccountId\\\":\\\"<REDACTED>\\\",\\\"RequestId\\\":\\\"e39c6ef2-c439-4103-8fb5-16a14e7b22a9\\\",\\\"AutoScalingGroupARN\\\":\\\"arn:aws:autoscaling:us-east-1:<REDACTED>:autoScalingGroup:e0cbd456-bc5e-4d76-a5c4-a247233d1547:autoScalingGroupName/eks-ecbce5fc-7e21-ef70-6ae4-0a1580d5d516\\\",\\\"AutoScalingGroupName\\\":\\\"eks-ecbce5fc-7e21-ef70-6ae4-0a1580d5d516\\\",\\\"Service\\\":\\\"AWS Auto Scaling\\\",\\\"Event\\\":\\\"autoscaling:TEST_NOTIFICATION\\\",\\\"Time\\\":\\\"2022-05-19T12:32:03.932Z\\\"}\",\n  MD5OfBody: \"c1ff1e6c85114a8a7fffdbfe1db9041e\",\n  MessageId: \"74f9991c-d280-4e85-847c-35ce988f9e7d\",\n  ReceiptHandle: \"AQEBwFIYqV4DhnYg20gENtdlqnwg6QZRo63XIncfvAoU+paCl67CcVhx9sGlGFED9YHR5a6bad4sHXgrxRTEZ47ZHydm7WTAkyc9MvDmURZ896GDomKkJ8TuGAn+h7W3aOuMBIxQsWtbZsEtOJpCQcUwdtXvbrAizeopC0BJQjvGfrfaTZoiW0qhOxhVleC0vI/fDuurS7frZV+TG1QlJApw3TxFl48jdGziR31B9DMqATrkGB3cS5X++Jw7nXC/KT2iPQJQ0JP67+/rqGsodIon0w8kcDP5muxu84QgMke6amvTSgyW+Pp+lroMDkyqZMr4d17YfWf0gvLch25GMhB9co64CMgeSsGVYX/zxv8XR/e3cNTLyOJDdXPr2G1XIW+TmCMSUExcBkz0dxZJcDpojhx5YSBWaBuIqfNywbP9AzU=\"\n})"
2022/05/19 12:35:19 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2022/05/19 12:35:37 INF only lifecycle termination events from ASG to SQS are supported outside EventBridge
2022/05/19 12:35:37 ERR error processing SQS message error="unsupported message type ({\n  Attributes: {\n    SentTimestamp: \"1652963523984\"\n  },\n  Body: \"{\\\"AccountId\\\":\\\"<REDACTED>\\\",\\\"RequestId\\\":\\\"de072de4-c55e-4378-a223-7def3c58dc0d\\\",\\\"AutoScalingGroupARN\\\":\\\"arn:aws:autoscaling:us-east-1:<REDACTED>:autoScalingGroup:a0b3503a-0ca4-443b-a64c-d01835aa11c9:autoScalingGroupName/eks-72bce5fc-7e1e-731a-0f6a-e458726aa02f\\\",\\\"AutoScalingGroupName\\\":\\\"eks-72bce5fc-7e1e-731a-0f6a-e458726aa02f\\\",\\\"Service\\\":\\\"AWS Auto Scaling\\\",\\\"Event\\\":\\\"autoscaling:TEST_NOTIFICATION\\\",\\\"Time\\\":\\\"2022-05-19T12:32:03.912Z\\\"}\",\n  MD5OfBody: \"442ab440e6eb5f43b0c41b67b6229656\",\n  MessageId: \"87163343-f682-4e04-a562-3cec206e8730\",\n  ReceiptHandle: \"AQEBNP8IAxcHrLtvD9vXrc1Z7hf2+K6jI7gDBToVG+Y7Ualfp4iYwJjlQ+nUNVLYLKFvV9VJiqp9dWC0WK5SyjzCzSlUDeMnqL994mMy4wlFXh7RYGUmclPUQRtq0FgFQPHIOyOTipJL/q6U/qZumv6lb7TnBNp1u+ZuxNKJy3sDA7Un20IZ9m2zx+19hqcz9k34w9myWBnuB+hPIhcz4K8Qgl0f/PZQNUS+P0sB/7nSs4iLebA4t2bQlosEvlpyIMXOBiKwan7NpTzvmk65eK1WIsWfCZMNyW8/OQTc2f70r+IgLw/Id8j7hgeDtS/yymEokjFnVDi5szHdyIQ2JjzEVhd+sb/7Uwp+v+2RwMqKLVRpUxiw1FGpmxpwcojPPng+UdaGsduFPlV4dTJ+txzVdN7deSU9DM/qcTkXrlnWC7o=\"\n})"
2022/05/19 12:35:37 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2022/05/19 12:35:39 INF only lifecycle termination events from ASG to SQS are supported outside EventBridge
2022/05/19 12:35:39 ERR error processing SQS message error="unsupported message type ({\n  Attributes: {\n    SentTimestamp: \"1652963523993\"\n  },\n  Body: \"{\\\"AccountId\\\":\\\"<REDACTED>\\\",\\\"RequestId\\\":\\\"e39c6ef2-c439-4103-8fb5-16a14e7b22a9\\\",\\\"AutoScalingGroupARN\\\":\\\"arn:aws:autoscaling:us-east-1:<REDACTED>:autoScalingGroup:e0cbd456-bc5e-4d76-a5c4-a247233d1547:autoScalingGroupName/eks-ecbce5fc-7e21-ef70-6ae4-0a1580d5d516\\\",\\\"AutoScalingGroupName\\\":\\\"eks-ecbce5fc-7e21-ef70-6ae4-0a1580d5d516\\\",\\\"Service\\\":\\\"AWS Auto Scaling\\\",\\\"Event\\\":\\\"autoscaling:TEST_NOTIFICATION\\\",\\\"Time\\\":\\\"2022-05-19T12:32:03.932Z\\\"}\",\n  MD5OfBody: \"c1ff1e6c85114a8a7fffdbfe1db9041e\",\n  MessageId: \"74f9991c-d280-4e85-847c-35ce988f9e7d\",\n  ReceiptHandle: \"AQEBWAWLP+03XJzymHDdD7ckTozTgTm1kERcVDvIXYgLqrnJN3Ufh9Aw1yT+Shhnm0EPkQsQ+Eq2bvUbEjr/U0wGVt24MQ4nh1oQpQhs41o55SyNz2CM7pcougLvPY5Jl70s1hIQjA146yP3dRAZ1NPDdsyhxbLDEsaKwJ1+eZ6tgg7wQk6sQeYLdmUFft3XVUHh8bUc7hjNViszy1Vnw9xE8kdwVBlIS2MREt26/NvMnPFwaU0Avf++nn+szNjxFUbHa/gEZmTWzOzq34Ms8jl6vxqGQ0inGWGrP6kgmoO+55Enta5aExMV+CJLQBfoii3l+0kvU3Jc+CDjSz5sQN3rCJNqZj2aOJc2KcbFHSg0PiTQS2JOdinFBSwmosnnxjbRh4OO21JpELjYbjzwL8aAiOPaKzZSwGDXC7mjxaBVZBE=\"\n})"
2022/05/19 12:35:39 WRN There was a problem monitoring for events error="none of the waiting queue events could be processed" event_type=SQS_TERMINATE
2022/05/19 12:35:39 WRN Stopping NTH - Duplicate Error Threshold hit.
panic: none of the waiting queue events could be processed

goroutine 66 [running]:
main.main.func3(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/node-termination-handler/cmd/node-termination-handler.go:210 +0x649
created by main.main
	/node-termination-handler/cmd/node-termination-handler.go:192 +0xd51

@zerthimon thanks for bringing this issue up. Judging from the logs you've provided, it is caused by an issue that we just released a fix for yesterday as part of NTH release v1.16.4. Can you please update your NTH installation to v1.16.4 and confirm if it resolves your issue?

@AustinSiu Yes it's now resolved

2022/05/19 14:57:02 WRN ignoring ASG test notification
2022/05/19 14:57:02 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:57:22 WRN ignoring ASG test notification
2022/05/19 14:57:22 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:57:42 WRN ignoring ASG test notification
2022/05/19 14:57:42 WRN skip processing SQS message error="message is a test notification"

Thanks!

Now I have these messages appearing in log every 20 seconds:

2022/05/19 14:57:02 WRN ignoring ASG test notification
2022/05/19 14:57:02 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:57:22 WRN ignoring ASG test notification
2022/05/19 14:57:22 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:57:42 WRN ignoring ASG test notification
2022/05/19 14:57:42 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:58:02 WRN ignoring ASG test notification
2022/05/19 14:58:02 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:58:22 WRN ignoring ASG test notification
2022/05/19 14:58:22 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:58:42 WRN ignoring ASG test notification
2022/05/19 14:58:42 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:59:02 WRN ignoring ASG test notification
2022/05/19 14:59:02 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:59:22 WRN ignoring ASG test notification
2022/05/19 14:59:22 WRN skip processing SQS message error="message is a test notification"
2022/05/19 14:59:42 WRN ignoring ASG test notification
2022/05/19 14:59:42 WRN skip processing SQS message error="message is a test notification"
2022/05/19 15:00:02 WRN ignoring ASG test notification
2022/05/19 15:00:02 WRN skip processing SQS message error="message is a test notification"
snay2 commented

The latest fix for this issue, #644, has been released with app version 1.16.5 and Helm chart version 0.18.5.