"Metric leak" which results in restarts caused by out-of-memory
peterbueschel opened this issue · 3 comments
Describe the bug
The number of exposed metric samples for the aws-node-termination-handler increases continuously up until the configured default sample_limit
of 5000 (see upstream helm chart). At this point Prometheus marks this target an unhealthy and stops scraping it.
At the same also the CPU and memory utilization increase:
The memory usage increases up until the configured limit and the container gets an OOM signal.
-> It looks like the timeseries for actions_node
is the affected metric.
Relevant
A similar behavior was reported in #665 - but the ticket was closed without any actions/fixes.
Steps to reproduce
Deploy the NTH in a K8s cluster with node events - the number of timeseries on the metric endpoint page will increase but never decrease. The entries for drained nodes, which are not part of the cluster anymore, will stay on the metric page.
Example entry:
actions_node{node_action="cordon-and-drain",node_event_id="asg-lifecycle-term-6231<removed>",node_name="<removed>",node_status="success",service_name="unknown_service:node-termination-handler",telemetry_sdk_language="go",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.20.0"} 1
Expected outcome
The entries for drained nodes on the metric endpoint will be cleaned up after some time.
Environment
- NTH App Version: v1.19.0
- NTH Mode (IMDS/Queue processor): Queue processor
- Kubernetes version: v1.26.9-eks
- Installation method: helm chart
Same here, for now dropping the samples from serviceMonitor CRD.