grafana/k8s-monitoring-helm

Correct memory requests/limits for alloy

davidroth opened this issue · 3 comments

I could not find any defaults or documentation for configuring the memory requirements/limits for the alloy pod.
I recently tried configuring it to 400mi/400mi, but I keep getting oom exceptions.

So what are the memory requirements for the Alloy Agent Pod?
Is it possible to configure it so that it tries to gc before hitting the resource limit given by the kuberentes resources.memory property?

@davidroth It depends on the type of pod

Prometheus metrics

As a rule of thumb, per each 1 million active series and with the default scrape interval, you can expect to use approximately:

  • 0.4 CPU cores
  • 11 GiB of memory
  • 1.5 MiB/s of total network bandwidth, send and receive

Loki logs

As a rule of thumb, per each 1 MiB/second of logs ingested, you can expect to use approximately:

  • 1 CPU core
  • 120 MiB of memory

Pyroscope profiles

As a rule of thumb, per each 100 profiles/second, you can expect to use approximately:

  • 1 CPU core
  • 10 GiB of memory

https://grafana.com/docs/alloy/latest/tasks/estimate-resource-usage/

@bentonam Thanks for the hint. Does this mean that if I have ~25K active series being reported in grafana, I just divide 11Million/25K and therefore I can set it to ~440Mib?

@davidroth It just depends, as it is not necessarily based on what you see in Grafana. It would be more or less per cluster per pod in the Alloy metrics StatefulSet (as clustering is supported).

To better understand what is being written, you can enable additional Alloy metrics from the chart by setting metrics.alloy.metricsTuning.useIntegrationAllowList: true which will expose prometheus_remote_storage* metrics.

Once you know how many active series there are for a given cluster. You could loosely apply the formula:

Estimated Memory = ((Active Series / 1Mil) * 11GiB) / # of Replicas
Estimated CPU = ((Active Series / 1Mil) * .4CPU) / # of Replicas
Requests Memory = Estimated Memory * .8
Requests CPU = Estimated CPU * .8
Limits Memory = Estimated Memory * (1.5 to 3)
Limits CPU =  Estimated CPU * (1.5 to 3)

Please keep in mind that this is not a concrete calculation, it should get you in a general vicinity to serve as a starting point. You also have to consider what else the pod is doing, in this case the chart deploys 4 types of Alloy Pods:

  • Alloy (StatefulSet): Has a replicas of 1 by default, responsible for scraping metrics and all things OTEL
  • Alloy Logs (DaemonSet): Handles collection of pod logs from each worker
  • Alloy Events (Deployment): Retrieves K8s Events from the API and writes them as logs.
  • Alloy Profiles (DaemonSet): Responsible for Profiles

Based on this, if you are doing anything with OTEL, you would also need to take into account that the pod is doing more than just metrics, it is also handling OTEL logs/traces, and you would need to adjust the resources accordingly.