phsmith/rundeck_exporter

Wrong execution count for a time period

Closed this issue · 4 comments

Hello !

We’ve just deployed the exporter on our Rundeck and we have an issue when we try to set up an alert.

We want to raise an alert when a given job (scheduled to be triggered once an hour) fails more than 2 times in the last 3 hours.

We thought we had the right query with that (taken from this issue) :

sum by (job_name) (max_over_time(rundeck_project_execution_status{job_name="[HIDDEN] Test Rundeck Nodes",status="failed"}[3h])) >= 2

But the alert is still up even when the 2 jobs failed more than 6 hours ago :
image

The issue is more obvious when we try to get the number of executions (whatever the status is) in the last 3 hours with this query :

sum by (job_name) (max_over_time(rundeck_project_execution_status{job_name="[HIDDEN] Test Rundeck Nodes"}[3h]))

image
We get a total of 17 when it should be 3.
Here is the screenshot take from Rundeck activity tab of the job :
image

We don’t know how to solve this problem, can you help us ?

Hey @bcruvelier! Thanks for using the rundeck_exporter!

I got it, but I wasn't able to replicate this behavior. See in the images:

For longer period, I got 5 failed executions in Rundeck and also in Prometheus
Screenshot_20230329_190009

Screenshot_20230329_190017

For little bit shorter period, I got 2 failed executions only on both Rundeck and Prometheus:

Screenshot_20230329_185846

Screenshot_20230329_185905

Thanks for your quick response.

Do you have suggestions on how I can debug it ?

I'm out of ideas at this point.

Edit :
By looking at the grafana json example, I have this query that seems to give me the right value (although I don't understand the .0165483472080457 part) :
sum(increase(rundeck_project_execution_status{job_name="[HIDDEN] Test Rundeck Nodes"}[3h])) by (project_name, job_name)
image

increase() seems to be a good option in your case. It returns the value as a float value because of the way it calculates the metrics in the interval.

If you want to get a integer value you can use floor() function like this:

floor(sum(increase(rundeck_project_execution_status{job_name="[HIDDEN] Test Rundeck Nodes"}[3h])) by (project_name, job_name))

Hey @bcruvelier , I'm closing the issue, but you can reopen it if the problem wasn't solved. 👍