seznam/slo-exporter

[ENHANCEMENT] alternative to slo_event_producer by expression evaluation

lksv opened this issue · 0 comments

lksv commented

Follows example of slo_rules.yaml with new semantics.

Example consists of two parts:

  1. thresholds for each class and category
  2. rules with expressions. For expressions it seems to me that https://github.com/antonmedv/expr is would be great choice.

First part should be exported as Prometheus metrics as well. In the same (compatible) format as a SLO metadata. Which can lead to simple configuration of slo-exporter.

Note that term category is used for availability, latency, etc. On the other hand slo_type must explicitly identify particular metric/SLI/SLO. Therefore in case that more that one SLI for category is used than slo_type identify the exact one:

  • category: "latency", slo_type: "latency99", percentile: "99"
  • category: "latency", slo_type: "latency90", percentile: "90"

Following example do not describe usefull SLO definition, it is intended as a showcase of possible usuage.

classes:    
  - version: "1"    
    # Keys are SLO Classes and under each key is dictionary which keys define  
    # slo_types (availability, latency90, latency99 etc.    
    #    
    # If value of the dict contains:    
    # * a number then it is interpreted as a `threshold => <number>` e.g.:    
    #   `{ "availability" => 99.9 }` is only abbreviated notation to    
    #   ```    
    #   {    
    #     "availability" => { "threshold" => 99.9, "slo_category" => "availability", "slo_class" => "availability"}
    #   }    
    #   ```    
    # * an array of disct, then it expanded as example shows:    
    #   Form `"latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }]` to
    #   ```    
    #   {    
    #     "latency99" => { "threshold" => 99, "maxDuration" => 0.5, slo_category => "latency", slo_type => "latency99" },
    #     "latency90" => { "threshold" => 90, "maxDuration" => 0.2, slo_category => "latency", slo_type => "latency90" }
    #   }    
    #   ```    
    # * a dict:    
    #   If keys `slo_category` or `slo_type` are not present then they are set 
    #   to same value as a key pointing the the dict.  Then this dict is       
    #   accessible from rule expressions and `threshold` value is passed over  
    #   to the Prometheus to be used as a SLO threshold.
    # 
    # First version might implement only dict version.
    #    
    # Following lines are intentionally long without line braking    
    # It's useful to make visually straightforward to compare individual        
    # slo classes and categories (slo_class & slo_types) each other.          
    #    
    critical:  { "availability" => 99.9, "latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }] }
    high_fast: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 1.5}, { "threshold" => 90, "maxDuration" => 0.5 }] }
    high_slow: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 3.0}, { "threshold" => 90, "maxDuration" => 2.0 }] }
    low:       { "availability" => 99.0, "latency" => [{ "threshold" => 99, "maxDuration" => 6.0}, { "threshold" => 90, "maxDuration" => 3.0 }] }
  - version: "2"    
    critical:  { "availability" => { "threshold" => 99.9, "maxDuration" => 0.2 } } 
    high_fast: { "availability" => { "threshold" => 99.0, "maxDuration" => 1.5 } } 
    high_slow: { "availability" => { "threshold" => 99.0, "maxDuration" => 3.0 } } 
    low:       { "availability" => { "threshold" => 95.0, "maxDuration" => 6.0 } } 
    
    
# evaluation workflow:    
# 1. Input event class is determined first (e.g. `slo_class=critical`).        
# 2. For each version in class table:    
#    1. Only rules groups which group_expr results to true are evaluated.       
#    2. When rules are evaluated all variables form `classes` definition table are accessible.
#    3. when additional_metadata are defined, then:    
#       * all values which are string are added to the slo_event    
#       * all dict values which are dict and has only `expr` key are evaluated and result is added to the slo_event
#       * otherwise an error metrics is increased.    
    
slo_domain: 'autoadmins'    
  rule_groups:    
    - group_expr: 'version == "1"'    
      rules:    
      - slo_type: 'availability'    
        slo_result_exp: "statusCode < 500"    
      - slo_type: 'latency90'    
        slo_result_expr: "requestDuration < class.latency99.maxDuration"       
        additional_metadata:    
          percentyle: 90    
          le: 0.2  #hardcoded same number as `class.latency99.maxDuration`     
      - slo_type: 'latency99'    
        slo_result_expr: "requestDuration < class.latency90.maxDuration"       
        additional_metadata:    
          percentile: 99    
          le:    
            - expr: 'class.latency99.maxDuration'    
    
    - group_expr: 'version == "2"'    
      rules:    
      - slo_type: 'availability&latency'    
        slo_result_expr: "statusCode < 500 && requestDuration < availablity.maxDuration"
        additional_metadata:    
          percentile: 100    
          le:    
            - expr: 'class.availability.maxDuration'    
      # example of one category (slo_type) instead of three    
      - slo_type: 'availability&latency'    
        default_expr: "statusCode < 500 && requestDuration < availability.maxDuration"
    
      # example of expression defined additionals metadata    
      # it uses result of expression as slo_event.    
      # To the expression result is added `slo_type` key and result is checked to contains `slo_results` as boolean
      - slo_type: 'availability&latency'    
        slo_event_expr: "{ le: availability.maxDuration, percentile: availability, slo_result: (statusCode < 500 && requestDuration < availability.maxDuration) }"