[ENHANCEMENT] alternative to slo_event_producer by expression evaluation
lksv opened this issue · 0 comments
lksv commented
Follows example of slo_rules.yaml
with new semantics.
Example consists of two parts:
- thresholds for each class and category
- rules with expressions. For expressions it seems to me that https://github.com/antonmedv/expr is would be great choice.
First part should be exported as Prometheus metrics as well. In the same (compatible) format as a SLO metadata. Which can lead to simple configuration of slo-exporter.
Note that term category
is used for availability
, latency
, etc. On the other hand slo_type
must explicitly identify particular metric/SLI/SLO. Therefore in case that more that one SLI for category is used than slo_type identify the exact one:
- category: "latency", slo_type: "latency99", percentile: "99"
- category: "latency", slo_type: "latency90", percentile: "90"
Following example do not describe usefull SLO definition, it is intended as a showcase of possible usuage.
classes:
- version: "1"
# Keys are SLO Classes and under each key is dictionary which keys define
# slo_types (availability, latency90, latency99 etc.
#
# If value of the dict contains:
# * a number then it is interpreted as a `threshold => <number>` e.g.:
# `{ "availability" => 99.9 }` is only abbreviated notation to
# ```
# {
# "availability" => { "threshold" => 99.9, "slo_category" => "availability", "slo_class" => "availability"}
# }
# ```
# * an array of disct, then it expanded as example shows:
# Form `"latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }]` to
# ```
# {
# "latency99" => { "threshold" => 99, "maxDuration" => 0.5, slo_category => "latency", slo_type => "latency99" },
# "latency90" => { "threshold" => 90, "maxDuration" => 0.2, slo_category => "latency", slo_type => "latency90" }
# }
# ```
# * a dict:
# If keys `slo_category` or `slo_type` are not present then they are set
# to same value as a key pointing the the dict. Then this dict is
# accessible from rule expressions and `threshold` value is passed over
# to the Prometheus to be used as a SLO threshold.
#
# First version might implement only dict version.
#
# Following lines are intentionally long without line braking
# It's useful to make visually straightforward to compare individual
# slo classes and categories (slo_class & slo_types) each other.
#
critical: { "availability" => 99.9, "latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }] }
high_fast: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 1.5}, { "threshold" => 90, "maxDuration" => 0.5 }] }
high_slow: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 3.0}, { "threshold" => 90, "maxDuration" => 2.0 }] }
low: { "availability" => 99.0, "latency" => [{ "threshold" => 99, "maxDuration" => 6.0}, { "threshold" => 90, "maxDuration" => 3.0 }] }
- version: "2"
critical: { "availability" => { "threshold" => 99.9, "maxDuration" => 0.2 } }
high_fast: { "availability" => { "threshold" => 99.0, "maxDuration" => 1.5 } }
high_slow: { "availability" => { "threshold" => 99.0, "maxDuration" => 3.0 } }
low: { "availability" => { "threshold" => 95.0, "maxDuration" => 6.0 } }
# evaluation workflow:
# 1. Input event class is determined first (e.g. `slo_class=critical`).
# 2. For each version in class table:
# 1. Only rules groups which group_expr results to true are evaluated.
# 2. When rules are evaluated all variables form `classes` definition table are accessible.
# 3. when additional_metadata are defined, then:
# * all values which are string are added to the slo_event
# * all dict values which are dict and has only `expr` key are evaluated and result is added to the slo_event
# * otherwise an error metrics is increased.
slo_domain: 'autoadmins'
rule_groups:
- group_expr: 'version == "1"'
rules:
- slo_type: 'availability'
slo_result_exp: "statusCode < 500"
- slo_type: 'latency90'
slo_result_expr: "requestDuration < class.latency99.maxDuration"
additional_metadata:
percentyle: 90
le: 0.2 #hardcoded same number as `class.latency99.maxDuration`
- slo_type: 'latency99'
slo_result_expr: "requestDuration < class.latency90.maxDuration"
additional_metadata:
percentile: 99
le:
- expr: 'class.latency99.maxDuration'
- group_expr: 'version == "2"'
rules:
- slo_type: 'availability&latency'
slo_result_expr: "statusCode < 500 && requestDuration < availablity.maxDuration"
additional_metadata:
percentile: 100
le:
- expr: 'class.availability.maxDuration'
# example of one category (slo_type) instead of three
- slo_type: 'availability&latency'
default_expr: "statusCode < 500 && requestDuration < availability.maxDuration"
# example of expression defined additionals metadata
# it uses result of expression as slo_event.
# To the expression result is added `slo_type` key and result is checked to contains `slo_results` as boolean
- slo_type: 'availability&latency'
slo_event_expr: "{ le: availability.maxDuration, percentile: availability, slo_result: (statusCode < 500 && requestDuration < availability.maxDuration) }"