Azure/azure-monitor-baseline-alerts

[Question/Feedback]: Strategy for avoiding configuration drift?

Closed this issue ยท 4 comments

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Description

Hi all,

I have a challenge with configuration drift and wonder if you guys have any suggestions or thoughts regarding this topic.

Background

Basically, Azure Policy deploys the alerts with ARM deployments right. This ARM deployment is sort of a "deploy and forget" deployment to my understanding.

Policy Compliance, more or less, only cares if the alert resource itself is deployed, not the content of the alert. If the alert is not yet deployed, a remediation task can be executed to fix the alert.

This means that if I edit a deployed alert in any way, for example a threshold, Azure Policy will not pick this up to revert it back to its original threshold - Which usually is what you want when working with IaC (and to some degree, Azure Policy)

This can cause some issues

  1. After AMBA is deployed, changes made in the codebase to the alerts are never deployed to the alert resources. The Policy Definition is naturally updated to reflect the changes, but as the policy assignment is compliant it never deploys the update to the alert resource.

    • Result: Configuration Drift and loss of control for IT Administrators
  2. If going with the decentralized approach where alerts are deployed to the landing zones themselves, end users can modify the alerts with undesired values causing alarm-storms or even non-working alerts.

    • Result: Configuration drift, and changing the alerts goes unnoticed and can potentially create a breach of SLA with the customer because an alert didn't fire when it should have. (Likewise this would be a problem in the centralized approach, if an IT Admin changes the alert instead)

Potential solutions:

  • Is it possible to modify policy definitions to also look for content such as thresholds? Not sure if the existenceCondition field can be this complex
  • Somehow force re-deployments on every pipeline run. Not sure if this is possible with the way Azure Policy works.
  • Make a full swap to Terraform/Bicep where the alerts are fully managed instead of with Policy. At least with Terraform, this would require a continous pipeline that scans for subscriptions (data source) and deploys alerts for each subscription.
    • This obviously requires a complete re-write, it's more of a solution to my specific case

@NikolaiKleppe thank you for your feedback. We are actively looking at multiple scenarios/ solutions to determine how this could be best addressed. I will leave this issue open, and will tag you when there are updates.

@NikolaiKleppe - I have not tested this with AMBA yet (planning to test soon, so can feedback once I have had a chance) but for your second point that mentions that end users can modify alerts im wondering if using Deployment Stacks could help.

My thinking here is if AMBA is to change the pipeline deployment line to be something along the lines of :

az stack mg create --name '<deployment-stack-name>' --location '<location>' --template-file '[<bicep-file-name>](https://raw.githubusercontent.com/azure/azure-monitor-baseline-alerts/main/patterns/alz/alzArm.json)' -- parameters 'alzArm.param.json' --deny-settings-mode 'denyWriteAndDelete'

With the last deny-settings-mode 'denyWriteAndDelete' being what should stop people from being able to update . I am not sure though if this will work given , as you have highlighted, the deployment is via Azure Policy not direct deployment of Alerts, so these controls might not be inherited.

@arjenhuitema Hi again, I watched the External Community Call - December 2023 where you guys brought up the configuration drift issue in AMBA. It seems extending the existenceCondition is the way to go?

Do you guys have any more updates regarding this?

I tested to add a condition for threshold for a metric alert and it seems to work fine:

{
  "field": "Microsoft.Insights/metricAlerts/criteria.Microsoft-Azure-Monitor-SingleResourceMultipleMetricCriteria.allOf[*].threshold",
  "equals": "[parameters('threshold')]"
}

image

After remediation:

image

I suppose this can be extended to quite a lot of properties, looking at the aliases available.

Hi @NikolaiKleppe, yes existenceCondition is the way we are implementing this. I've just linked a PR that adds the following parameters to the existenceCondition

  1. evaluationFrequency
  2. windowSize
  3. threshold
  4. severity
  5. operator
  6. timeAggregation
  7. autoMitigate
  8. alertSensitivity (for dynamic thresholds)
  9. numberOfEvaluationPeriods (for dynamic thresholds)
  10. minFailingPeriodsToAlert (for dynamic thresholds)