Inconsistent alerts triggered by the Prometheus alert manager

Question

Inconsistent alerts triggered by the Prometheus alert manager

varampati6 opened this issue 9 months ago · 0 comments

Host operating system: output of `uname -a`

blackbox_exporter version: output of `blackbox_exporter --version`

blackbox_exporter, version 0.20.0 (branch: HEAD, revision: 91372eb)
build user: root@d6d8976bddf4
build date: 20220316-17:42:45
go version: go1.17.8
platform: linux/amd64

What is the blackbox.yml module config.

modules:
https_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.0","HTTP/1.1", "HTTP/2.0"]
method: GET
preferred_ip_protocol: "ip4"
valid_status_codes: [200,403,404,502] # An empty list defaults to 2xx
fail_if_ssl: false
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: true

What is the prometheus.yml scrape config.

global:
scrape_interval: 60s
evaluation_interval: 10m
scrape_configs:

job_name: 'blackbox'
metrics_path: '/probe'
params:
module:
- https_2xx
static_configs:
- targets:
  - 'https://in-jira.domain.mycompany.com'
  - 'https://in-jenkins.domain.mycompany.com'

What logging output did you get from adding `&debug=true` to the probe URL?

What did you do that produced an error?

The configuration as mentioned above

What did you expect to see?

"The Prometheus query probe_success{job="blackbox"} == 0 should return the applications that are in a down state for a 10-minute interval, as specified in the configuration mentioned above. In my case, the Jira application mentioned above is down."

What did you see instead?

"Instead of getting an alert from one result, probe_success{job="blackbox"} == 0 is returning inconsistent outputs. For instance, Jira application may go down for a couple of minutes, and sometimes the Jenkins application is down even though the Jenkins application is up when running the probe_success query"

Host operating system: output of uname -a

blackbox_exporter version: output of blackbox_exporter --version