Recommended way to alert on inability to assign IPs to Pods

Question

Recommended way to alert on inability to assign IPs to Pods

sidewinder12s opened this issue 9 months ago · 9 comments

What happened:

Is there a recommended way of alerting either using Kubernetes/CNI Helper/KSM/Node Exporter/cAdvisor metrics to alert when the CNI is unable to allocate IPs, runs out of IPs, etc?

We have had multiple incidents caused by the CNI either running out of IPs and being unable to allocate more out of the subnet or the CNI being unable to allocate IPs for unknown reasons and have had trouble IDing potential ways to alert on this as the only place I've seen explicit messages for what is wrong being logs off the CNI.

We need to be able to alert on this, due to a multitude of issues/migrations required we cannot use the recommended mitigations of IP Prefix Assignments or IPv6.

Environment:

Kubernetes version (use kubectl version): 1.25
CNI Version: v1.13.3
OS (e.g: cat /etc/os-release): EKS AMI v20230825
Kernel (e.g. uname -a): 5.10.186-179.751.amzn2

Answer 1 · 2023-11-01T18:16:17.000Z

We have a container roadmap item for this - aws/containers-roadmap#2011 and one of the information covers IP and subnet utilization.

Answer 2 · 2023-11-01T18:21:51.000Z

@sidewinder12s the CNI gets available IPs from IPAMD, so the reasons for not being able to allocate IPs are:

IPAMD is not reachable (typically means aws-node pod is restarted while kubelet is still issuing CNI ADDs)
No IPs are available, which could be caused by:

CNI waiting for new ENIs to be attached (we have capacity, and are waiting for ENIs to be attached to instance)
All IPs are in use, and CNI is waiting for IP addresses to come out of cooldown (previously in use)
No more ENIs can be attached and all available IPs are in use (instance hit ENI limits)
Subnet is exhausted

As for metrics, we do not really delineate these cases today, but we do have a prometheus metrics, awscni_no_available_ip_addresses, for tracking every time that the CNI got a response from IPAMD that no IPs were available

Answer 3 · 2023-11-01T18:25:36.000Z

This ask is partially driven off this issue: #2650

We had plenty of IPs in the subnet, the CNI was just failing to acquire more IPs for some reason (the node appeared otherwise healthy and IPAMD appeared to be functional as well): #2650

awscni_no_available_ip_addresses appears to be brand new and I think may work for just giving us base something is wrong alerting.

Answer 4 · 2024-01-01T00:03:47.000Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Answer 5 · 2024-01-02T23:25:24.000Z

not stale.

Answer 6 · 2024-01-02T23:27:22.000Z

Design for this is currently in progress, i.e. exposing per-node IPAM metrics

Answer 7 · 2024-02-09T15:28:25.000Z

@sidewinder12s working through these monitoring aspects as well so wanted to share, but this might be out of context based on your environment...

reading over prometheusmetrics.go, the prom support is still limited in latest release. only gauges are exported. that does include a number of IP/ENI-related metrics, but does not include several that relate to IPAM. this is probably something us/the community can help improve, i've been trying to get a minimal monitoring setup going and learning enough along the way to understand the technical reasons for current state.

until that is addressed, devs will keep me honest here, i think the best way is running cni-metrics-helper in cloudwatch mode. then you could configure prometheus cloudwatch-exporter to pull/alert on all or only the subset of metrics you care about.

if you have a prometheus based monitoring stack this works well. if you rely on another agent to scrape prometheus metrics, it is not as helpful... though you could create a shim that pulls from cloudwatch and pushes custom metrics to your monitoring provider (i've done this with datadog in the past).

sorry i don't have a more direct answer today, these are just top of mind right now as i am working through the same challenge.

Answer 8 · 2024-04-10T00:03:18.000Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Answer 9 · 2024-04-24T00:03:20.000Z

Issue closed due to inactivity.