/pat

Prometheus Alert Testing utility

Primary LanguageGoApache License 2.0Apache-2.0

 ____   _  _____ 
|  _ \ / \|_   _|
| |_) / _ \ | |  
|  __/ ___ \| |  
|_| /_/   \_\_|  
                 

Prometheus Alert Testing tool

CircleCI

You may also be interested in PromCLI

Build & Install

go get github.com/kevinjqiu/pat

You must have golang 1.9+ and dep installed.

Build from source

Check out this repo to $GOPATH/src/github.com/kevinjqiu/pat

and then:

cd $GOPATH/src/github.com/kevinjqiu/pat && make build

Usage

pat [options] <test_yaml_file_glob>

e.g.,

pat test/*.yaml

Test File Format

Test files are written in yaml format. For a complete schema definition (in jsonschema format), see here.

Top level attributes

  • name - The name of the test case
  • rules - The rule definitions that are under test
  • fixtures - The fixture setup for the tests
  • assertions - The test assertions

Rules

The rules section defines how the rules-under-test should be loaded. Currently, two rules loading strategies are supported:

  • fromFile - load the rules from a .rules yaml file. If the path specified is not an absolute path, the rule file path will be relative to the test file.
  • fromLiteral - embed the rules under test right inside the test file.

Example

rules:
  fromFile: http-rules.yaml

or

rules:
  fromLiteral: |-
    groups:
      - name: prometheus.rules
        rules:
          - alert: HTTPRequestRateLow
            expr: http_requests{group="canary", job="app-server"} < 100
            for: 1m
            labels:
              severity: critical

Fixtures

The fixtures section defines a list of metrics fixtures that the tests will be using. Each item in the list has the following attributes:

  • duration - How long these metrics will be set to the specified value. The duration must be acceptable by Golang's time.ParseDuration(), e.g., 5m (5 minutes), 1h (1 hour), etc.
  • metrics - The metrics and their values

Example

fixtures:
  5m:
    - http_requests{job="app-server", instance="0", group="blue"}	75
    - http_requests{job="app-server", instance="1", group="blue"}	120

This will create these two metrics, with the values last for 5 minutes.

You are also able to specify multiple metrics values:

  5m:
    - http_requests{job="app-server", instance="0", group="blue"}	75 100 200

In this case, the metric http_requests{job="app-server", instance="0", group="blue"} will be set to 75 for the first 5 minutes, 100 for the next 5 minutes and 200 for the next 5 minutes. You can use this form to easily setup long running time series.

Assertions

The assertions section contains a list of expectations when the alert rules are evaluated at certain time.

  • at - The instant when the rules are being evaluated
  • expected - The list of expected alert properties

Example

assertions:
  - at: 0m
    expected:
      - alertname: HTTPRequestRateLow
        alertstate: pending
        job: app-server
        severity: critical
  - at: 5m
    expected:
      - alertname: HTTPRequestRateLow
        alertstate: firing
        job: app-server
        severity: critical
  - at: 10m
    expected: []

In this example, we're asserting that when the alert rules are evaluated at 0m, with the given fixtures, we should get HTTPRequestRateLow alert in pending state, and when evaluated at 5m, the alert should be in firing state. When evaluated at 10m, we shouldn't get any alert.

A Complete Example

Suppose you have the following rule file that you want to be tested:

groups:
  - name: prometheus.rules
    rules:
      - alert: HTTPRequestRateLow
        expr: http_requests{group="canary", job="app-server"} < 100
        for: 1m
        labels:
          severity: critical

Write a yaml file with your test cases:

name: Test HTTP Requests too low alert
rules:
  fromFile: rules.yaml
fixtures:
  - duration: 5m
    metrics:
      - http_requests{job="app-server", instance="0", group="canary", severity="overwrite-me"}	75 85  95 105 105  95  85
      - http_requests{job="app-server", instance="1", group="canary", severity="overwrite-me"}	80 90 100 110 120 130 140
assertions:
  - at: 0m
    expected:
      - alertname: HTTPRequestRateLow
        alertstate: pending
        group: canary
        instance: "0"
        job: app-server
        severity: critical
      - alertname: HTTPRequestRateLow
        alertstate: pending
        group: canary
        instance: "1"
        job: app-server
        severity: critical
    comment: |-
      At 0m, the alerts met the threshold but has not met the duration requirement. Expect the alert to be pending
  - at: 5m
    expected:
      - alertname: HTTPRequestRateLow
        alertstate: firing
        group: canary
        instance: "0"
        job: app-server
        severity: critical
      - alertname: HTTPRequestRateLow
        alertstate: firing
        group: canary
        instance: "1"
        job: app-server
        severity: critical
    comment: |-
      At 5m, the alerts should be firing because the duration requirement is met.
  - at: 10m
    expected:
      - alertname: HTTPRequestRateLow
        alertstate: firing
        group: canary
        instance: "0"
        job: app-server
        severity: critical
    comment: |-
      At 10m, the alert should be firing only for instance 0 because instance 1 is >= 100.
  - at: 15m
    expected: []
    comment: |-
      At 15m, both instances are back to normal, therefore we expect no alert.

Run the test:

$ ./pat examples/test.yaml
=== RUN   Test_HTTP_Requests_too_low_alert_at_0m
--- PASS: Test_HTTP_Requests_too_low_alert_at_0m (0.00s)
=== RUN   Test_HTTP_Requests_too_low_alert_at_5m
--- PASS: Test_HTTP_Requests_too_low_alert_at_5m (0.00s)
PASS