aws/aws-node-termination-handler

"Metric leak" which results in restarts caused by out-of-memory

peterbueschel opened this issue · 3 comments

Describe the bug

The number of exposed metric samples for the aws-node-termination-handler increases continuously up until the configured default sample_limit of 5000 (see upstream helm chart). At this point Prometheus marks this target an unhealthy and stops scraping it.

image

At the same also the CPU and memory utilization increase:
image

The memory usage increases up until the configured limit and the container gets an OOM signal.

-> It looks like the timeseries for actions_node is the affected metric.

Relevant

A similar behavior was reported in #665 - but the ticket was closed without any actions/fixes.

Steps to reproduce
Deploy the NTH in a K8s cluster with node events - the number of timeseries on the metric endpoint page will increase but never decrease. The entries for drained nodes, which are not part of the cluster anymore, will stay on the metric page.

Example entry:
actions_node{node_action="cordon-and-drain",node_event_id="asg-lifecycle-term-6231<removed>",node_name="<removed>",node_status="success",service_name="unknown_service:node-termination-handler",telemetry_sdk_language="go",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.20.0"} 1

Expected outcome
The entries for drained nodes on the metric endpoint will be cleaned up after some time.

Environment

  • NTH App Version: v1.19.0
  • NTH Mode (IMDS/Queue processor): Queue processor
  • Kubernetes version: v1.26.9-eks
  • Installation method: helm chart
doryer commented

Same here, for now dropping the samples from serviceMonitor CRD.