/simple-circuit-breaker-for-azure-tm

simple-circuit-breaker-for-azure-tm

Primary LanguageGoMIT LicenseMIT

Simple circuit breaker for Azure Traffic Manager

Background

Azure Traffic Manager is useful for site failover. However, it might fail back before the site has fully recovered because the probe might assume the site has recovered after a single success. The health endpoints that probes check should confirm full recovery, but may not cover complex failures. Failing back due to false negatives has adverse effects such as flapping.

Solution

  • Create a simple circuit breaker to disable the down end points
  • The "simple" intent is that it doesn't automatically return the breaker to the closed state, in other words, it doesn't re-enable the endpoint. It is assumed that an operator will judge whether failback is possible and manually failback
  • Implement the breaker on Azure Functions (Go)
    • Sample code is in this repository
    • Tested on Linux Consumption plan
  • Azure Monitor checks Azure Traffic Manager endpoint status metric and alerts the breaker when some endpoints are not online

Overview

flowchart LR
    subgraph TM[Azure Traffic Manager]
        probe[Probe]
        method[Routing Method: Priority]
    end
    probe -- probe fail --x ep1[Endpoint1: high priority]
    probe --> ep2[Endpoint2: low priority]
    AM[Azure Monitor] -- watch metric --> TM
    AM -- alert action --> breaker
    subgraph AF[Azure Functions]
        breaker[breaker function]
    end
    breaker -- disable endpoint on routing targets --> AAPI[Azure Resource Manager API]
Loading

Conditions for disabling endpoints

Conditions are somewhat conservative considering the risks.

  • Azure Traffic Manager Routing Method: Priority
  • Alert Condition: Fired
  • Have multiple endpoints
  • Have at least one online endpoint
  • Sort endpoints by priority and disable endpoints that are not online and have not been disabled. Exit if an online endpoint is found.

Alert rule example

  • "criterionType": "StaticThresholdCriterion"
  • "threshold": 0
  • "operator": "LessThanOrEqual"
  • "timeAggregation": "Minimum"
  • "evaluationFrequency": "PT1M"
  • "windowSize": "PT1M"

Room for improvement