prometheus-community/stackdriver_exporter

Handle DELTA metrics better

glightfoot opened this issue · 17 comments

Right now, there are several issues stemming from stack driver's DELTA metrics, which just record the change in a metric. This leads to inconsistent data in prometheus, since the value reported by the exporter is a gauge, which is only set to the value of the change. Ideally, we'd update a gauge with the value of the delta change to reduce the impedance mismatch between stackdriver and prometheus. This is most noticeable with infrequently updated metrics, which produce a single gauge data point in prometheus once in a while, with huge gaps between them.

It looks like DELTA metrics used to be handled as counters, but that was changed with #15 since DELTAs can go down. There probably just needs to be some different handling of how we set the gauge. If it's a GAUGE type from stackdriver, we set the value of the prometheus gauge to the value from stackdriver. If it's a DELTA type from stackdriver, we add the value of the stackdriver delta to the value of the prometheus gauge.

That sounds like a reasonable idea. We would need to generate a cache pool for delas. We would also need to have some way to figure out if a metric stops so we can remove items from the cache. It's tricky to do, since I don't think Stackdriver has any concept of staleness.

We could reuse the expiring registry from the statsd_exporter? If we get a delta, pull a gauge from the registry and add the value to it. I'm not familiar enough with the current collector operation to know how to integrate the two, though.

After thinking about this some more, this won't be that straightforward. In the case of counters represented by deltas, that's easy since we don't really care about the absolute value of the metric, but for gauges represented by deltas, it would be meaningless without having some way of reconciling the value on the exporter end with the true value in stackdriver. I don't know how many of those cases there are (metrics from stackdriver that behave as gauges but are represented by deltas), but it's a case that wouldn't make sense to return data for...

Is there a stackdriver api method to get the value of a delta? might be worth the extra api call to get accurate data

Whats the latest thinking on this, does it seem feasible? I'm happy to help with an implementation if so.

It seems to me this is quite important -- you cant calculate rate() of GAUGE metrics in prometheus properly, which nullifies a lot of the usefulness of importing many stackdriver DELTA metrics (for which which rate() works fine on the stackdriver side)

If it's a DELTA type from stackdriver, we add the value of the stackdriver delta to the value of the prometheus gauge.

I think it should be just switched back to counter in Prometheus, and stackdriver-exporter needs to accumulate counts in memory to produce monotonically increasing Prometheus counters.

For example, https/backend_request_count is currently converted to gauge, which is obviously meant to be a counter, not a gauge (see documentation).

The same issue applies to histograms (DISTRIBUTION + DELTA, e.g. https/backend_latencies): each bucket is a counter, and most GCP distributions report them as deltas. stackdriver-exporter converts it into a Prometheus histogram, but its bucket values contain deltas rather than monotonically increasing total count (as required by Prometheus).
As a result, histogram expressions such as histogram_quantile(0.99, sum(rate(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_latencies_bucket[5m])) by (le)) produce incorrect graphs.

We're encountering a similar issue here with the GCP composer's run_count metric, however, the behavior is a bit weirder.

The datapoint in Cloud Monitoring is 1, however, when exported to Prometheus, it was showing as 4.
According to the current implementation of how it's directly taking the Delta value and converting it to a Gauge with the same value, it should also be 1 regardless of previous values.

@nilebox That seems like a reasonable option to me. One thing I've been looking at is how similar things happen in the fastly-exporter. It's receiving a series of deltas and adding them directly as "normal" metrics in a registry.

The down side to this is that I'm seeing jitter in the time between the return of the delta value. This leads to some deltas arriving a fraction of a second too late for the scrape. It's probably better to accumulate in a custom structure, that includes the timestamp of "settled time" of that delta value, which is exposed at scrape time. This way any delays in API calls don't leak into Prometheus.

Any updates on this? This makes the exporter nearly unusable in many cases.

My recommendation is the exporter should integrate the delta values to produce a counter.

In other words, hold an internal counter, add the delta to it each time you fetch it, and export the counter.
The value will drop to 0 each time the exporter process restarts, but Prometheus can cope with this as long as you don't do it too often.

(EDIT: just noticed this was said a year ago #116 (comment))

I had started a naive implementation here. I believe in order to produce an accurate value I will need to update the collector to consider time since last scrape as a part of calculating the start/end time and incrementing the counter for all data points found when querying for metric descriptors with kind = delta. Does that sound correct?

I didn’t understand the bit about time.
The value is never “accurate”; it is sampled as-at some time in the past.

Let me clarify, you are correct the value will never truly be accurate, I'm trying to get the value incremented from scrape to scrape to be as accurate as reasonably possible. This might be over complicating the implementation though.

Example, scraping the exporter every 5 minutes for the apigateway proxy/request_count metric which is sampled by GCP every 1 minute. I would expect there to be 5 data points from GCP which need to increment the counter to produce an accurate increase between scrapes. The current logic uses the configuration values monitoring.metrics-interval and monitoring.metrics-offset to set the start/end time. If monitoring.metrics-interval is set to 1m then the request to GCP will only likely see the latest sample and produce an inaccurate value. Similarly, if monitoring.metrics-interval is set to 5m but a scrape is missed the value produced for the next scrape will also likely be inaccurate. Using the time since last scrape (as long as it's within a reasonable window) will produce a much more accurate value.

I should note I've never looked at this code before.
I think you are talking about the start and end times here:

endTime := time.Now().UTC().Add(c.metricsOffset * -1)
startTime := endTime.Add(c.metricsInterval * -1)

What I would expect is we remember the last time we called the API, so that's the start, and end is "now", then we get however many points are new, and we add them all up.

I also have a working implementation although I still need to write unit tests and it needs some minor cleanup from us debugging.

This handles the scrape timing stuff correctly but in a pretty simple way...I think.

I'm at Mailchimp where the issue was first opened by Greg (and @SpencerMalone re-queried about it). This seems to meet our needs and would love to get it (or a version of this) upstreamed!

Of note -- this does not currently explicitly handle DISTRIBUTION DELTAs -- I missed the reference to this above when I put this together. I'll take another gander at the code, but I feel like this solution can be amended to handle that also. Mentioned here #25

I also have a working implementation although I still need to write unit tests and it needs some minor cleanup from us debugging.

This handles the scrape timing stuff correctly but in a pretty simple way I think.

This is awesome especially if you have feedback from running it live! I do wonder about a few things based on the slightly different approach we both took,

  1. Are there delta metrics you have seen which should be treated as a gauge? It's been a bit since I looked closely but it seemed as though all were better represented as counters.
  2. Any thoughts on the conversation above related to modifying the startTime for calling the TimeSeries API and incrementing for every data point found based on previous scrape time?
  3. I have noticed at some times that GCP does not always export a delta metric if the value doesn't change. This can cause a staleness issue as the series appears to no longer be reporting. This caused me to choose to export all delta metrics which have had a count change within a TTL window.

By no means am I an expert on GCP metrics just comparing our design decisions.

This is awesome especially if you have feedback from running it live! I do wonder about a few things based on the slightly different approach we both took,...

I swear I replied to the above a few weeks ago....but apparently I did not. I had a laptop switch then spent 2 weeks doing trail work in Maine, so maybe the reply died in an input box somewhere. So lemme try again:

  1. Great point -- upon further inspection we decided to just add a global --aggregate-deltas switch that causes us to treat all DELTAs as things that should be aggregated and have aggregated values exported.
  2. I'm not sure about this TBH. In our case we're scaping at least as often as new values are available in stackdriver, so we really only need to make sure we don't double/triple/etc. count them. I'm sure there is nuance here I don't really understand though.
  3. This is something we haven't directly noticed although we are still in early days having working metrics so it's possible we would see this issue eventually. Seems like a reasonable solution in that case if we don't want to consider an unchanged value as stale.

I am also not an expert (or well even a journeyman) on GCP metrics or prometheus so I'm happy to have input and interest!

Since the last update we are now handling DISTRIBUTION DELTAs similarly (and have fixed a few bugs). I went ahead and opened a PR with the changes we are currently running in our prod environment. We're still in week 1 with this code, but will update if we notice issues.

Since the last update we are now handling DISTRIBUTION DELTAs similarly (and have fixed a few bugs). I went ahead and opened a PR with the changes we are currently running in our prod environment. We're still in week 1 with this code, but will update if we notice issues.

Awesome work! Related to the points I originally brought up,

  1. This is probably something that can be accounted for later/ignored if the need doesn't come up. I imagine most people scrape at least as often as data is available or should be setting their monitoring.metrics-interval accordingly if not edit: realized this could be a problem even if monitoring.metrics-interval is set appropriately it's tracking only the latest value vs all new values discovered. I think my original statement of, fix it later, still stands though.
  2. I found some of my old notes around this. I have example graphs which are rendering as disconnected points which I assumed was a staleness issue. But now I'm wondering if this might actually be an artifact of calculating the rate on a gauge and will test it out with the counter implementation this week.