temporalio/temporal

[Feature Request] Support exponential/native histograms in Temporal Server/SDKs

Opened this issue · 0 comments

Is your feature request related to a problem? Please describe.

Temporal Server is fairly expensive to monitor in self-hosted environments due to the volume of metric series generated. Observability platforms, such as AWS CloudWatch Metrics, Grafana Cloud, etc. charge per active metric series so the costs quickly add up.

Prometheus has experimental support for native histograms, and stability is improving daily. One of the main advantages of native histograms over Prometheus' classic histograms is that they can store the same data with fewer metric series and higher accuracy/resolution.

The presenter in the YouTube video "Prometheus Native Histograms in Production - Björn Rabenstein, Grafana Labs" at 17:30 states: "bottom line is you get 10x the resolution at half the price". That infographic also shows the number of series is ~16k compared to ~1k for classic vs native histograms in his example, respectively. Because you only need a single series to store the whole histogram (for a given set of labels).

For teams deploying a new Temporal installation, having the option to export exponential histograms would be great, as we can save costs and we since don't have extensive dashboards/alerting/SRE based on the old metric names, we can quickly build out the SRE on the new native histograms.

Describe alternatives you've considered

I'm using Grafana Alloy specifically to scrape the Temporal Server metrics. When I enable the option to scrape native histograms, no native histograms are scraped. The SDK metrics emitted using the OTel config are emitted as classic histograms. I believe this needs to be updated in the Server/SDK code.

Additional context
Add any other context or screenshots about the feature request here.