"Metric leak" which results in restarts caused by out-of-memory

Question

"Metric leak" which results in restarts caused by out-of-memory

peterbueschel opened this issue a year ago · 3 comments

Describe the bug

The number of exposed metric samples for the aws-node-termination-handler increases continuously up until the configured default sample_limit of 5000 (see upstream helm chart). At this point Prometheus marks this target an unhealthy and stops scraping it.

At the same also the CPU and memory utilization increase:

The memory usage increases up until the configured limit and the container gets an OOM signal.

-> It looks like the timeseries for actions_node is the affected metric.

Relevant

A similar behavior was reported in #665 - but the ticket was closed without any actions/fixes.

Steps to reproduce
Deploy the NTH in a K8s cluster with node events - the number of timeseries on the metric endpoint page will increase but never decrease. The entries for drained nodes, which are not part of the cluster anymore, will stay on the metric page.

Example entry:
actions_node{node_action="cordon-and-drain",node_event_id="asg-lifecycle-term-6231<removed>",node_name="<removed>",node_status="success",service_name="unknown_service:node-termination-handler",telemetry_sdk_language="go",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.20.0"} 1

Expected outcome
The entries for drained nodes on the metric endpoint will be cleaned up after some time.

Environment

NTH App Version: v1.19.0
NTH Mode (IMDS/Queue processor): Queue processor
Kubernetes version: v1.26.9-eks
Installation method: helm chart

Answer 1 · 2023-12-24T17:03:51.000Z

Same here, for now dropping the samples from serviceMonitor CRD.