seznam/slo-exporter

[BUG] incorrect increase calculation in prometheus_ingester

Closed this issue ยท 3 comments

Describe the bug

I have hit a strange situation when tinkering with slo-exporter for the first time. I suspect processMatrixResultAsIncrease calculates prometheus metric increases incorrectly when the query interval is shorter than the interval data is updated at prometheus. I have been manually replaying queries reported by the debug-level log and I get no result from prometheus most of the time. On the other hand I run a loop that hits the foo service with several requests each second. Strangely I see no events emitted even when prometheus finally updates the values.

I suspect this assignment erases any previously collected samples when prometheus answers with no results. When the ingester finally collects something then there is nothing to compare it to. I suppose the correct way to update previousResult would be to merge it somehow, e.g.:

func (r *queryResult) update(newResult queryResult)  {
	r.timestamp = newResult.timestamp

	for k, v := range newResult.metrics {
		r.metrics[k] = v
	}
}

How to reproduce the bug

I run prometheusIngester with these settings

modules:
  prometheusIngester:
    apiUrl: "http://127.0.0.1:9090"
    queryTimeout: 30s
    queries:
      - type: counter_increase
        query: 'istio_requests_total{destination_canonical_service="foo"}'
        interval: 5s

Then I saw a neverending list of messages like this:

time="2021-10-06T18:02:42.269+02:00" level=debug msg="executed query" component=prometheusIngester duration=7.28613ms query="istio_requests_total{destination_canonical_service=\"foo\"}[5s]" timestamp="2021-10-06 18:02:42.261701125 +0200 CEST m=+180.026099376"

Expected behavior

I expect messages from sloEventProducer etc.

Additional context

None

Hi, first sorry for the huge lag ๐Ÿ˜–

Regarding the issue, actually I remember that I wanted to document that the interval should not be shorter than the scrape interval. I apparently forgot to do it.

I get that having this restriction it can lead up to double of scrape interval delay between the actual time of event and the time slo_exporter evaluation, which gets worse with higher scrape intervals.

So one solution would be just to document this or, as you suggest, merge the results.
The second would need some additional logic to expire the old samples similarly as Prometheus staleness.

I took a stab at implementing the merging with a "staleness" support
draft is here #80

@RadekDvorak so the "fix" is already in master, we will release a new version soon.

If you find the time to try it out, we would be happy to hear if it works as expected!

Hi @FUSAKLA ,

I have been unable to reproduce the issue with v6.11.0 so far.

Thanks for the fix ๐Ÿ‘