A company wants to use Azure Monitor and alerts for the following events:
- Alert when a VM has high CPU utilization. Over 90% for 5 minutes. Filtered by Subscription and Resource Group.
- Alert when a VM has low memory. Less than 200MB available for 5 minutes. Filtered by Subscription and Resource Group.
- Alert when a VM has low disk space. Less that 10% Available. Filtered by Subscription and Resource Group.
- Alert when a VM is down, the agent is not reporting, or is generally unhealthy for a period of 5 minutes. Filtered by Subscription and Resource Group.
The maintenance window is specified in VM tag 'maintenance' with the format "zz-xddd-hhmm-HHMM-", where:
zz | Not used for alerting Purpose |
x | n-th weekday of the month when maintenance is performed every month. As in 3rd Wednesday, x=3 |
ddd | First 3 letters of the weekday when maintenance is performed |
hh | Hour of the day when maintenance starts |
mm | Hour of the day when maintenance starts |
HH | Hour of the day when maintenance ends |
MM | Hour of the day when maintenance ends |
Z | Not used for alerting purpose |
ALL TIMES TIMES ARE IN UTC.
In addition to having scheduled maintenance, it is desirable to be able to put a VM under maintenance alert excemption on demand. This way any VM can be temporarily removed from the alerts at any time for an indefinite period of time (not implemented in the solution yet).
- Azure Automation Runbook scheduled to run every hour to:
- Query VM Tag "maintenance" and determine the list of VMs that will be in maintenence in the next hour. The runbook will run 10 minutes before each hour (0:50, 1:50, 2:50) and will run 10 minutes ahead for determining maintenance VMs. This way it will scoop the right VMs in the right maintenance window and allow 10 minutes for Az Automation job queuing, job completion, and Log Analytics record ingestion. The end result will be that alerts will stop just a few minutes before the actual maintenance time.
- Send the list of VMs to a Custom Table in Log Analytics. The table is named MaintenanceVM_CL.
- Kusto queries for alerts joining the MaintenanceVM_CL custom table on Computer, taking into account the new maintenance records for the last hour so those VMs can be excluded from the query results.
maintenance.psm1 | Powershell module to be uploaded to your Azure Automation Account. This code is used by the Automation runbook |
maintschedule.ps1 | Automation runbook to be uploaded and scheduled every hour at the 50 min offset. Workspace Id and Shared Key need to be updated with your Log Analytics info. |
maintondemand.ps1 | Automation runbook to be uploaded and triggered through a Web Hook with VM Name, StartDtm and EndDtm. Workspace Id and Shared Key need to be updated with your Log Analytics info. |
Log Analytics Alerts | The code for the queries is provided below in the section Alert Implementation with Kusto Queries. The subscription id and resource group need to be updated with your values. |
- The solution works as long as every VM in the organization has a different name. Good naming standards are a best practice. Any collision with VM names would adversely affect this implementation. A potential solution to this issue is to use resourceId instead of VM Name
- The 'maintenance' tag requires the use of UTC. This is for code simplification and the fact that Log Analytics uses UTC times. Azure runs in UTC so your Azure operations should too.
- A Log Analytics Workspace.
- All VMs that will be monitored have to be enrolled in the Workspace
Go to Azure Monitor > Alerts
For each Alert:
- Click on New Alert Rule
- Click on Resource "Select" Button.
- Filter by resource type "Log Analytics Workspaces" and select the Workspace.
- Click on Condition "Add" button and Select Custom log search as Signal Name
- Configure the Kusto query as described below for each case.
- Configure Actions
- Click on Create Alert Rule
let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| where ObjectName == 'Processor' and CounterName == '% Processor Time'
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by tostring(sub), tostring(rg), bin(TimeGenerated, 5m), Computer
| join kind= leftouter(
MaintenanceVM_CL
| where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""
let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| where ObjectName == 'Memory' and CounterName == 'Available MBytes'
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by tostring(sub), tostring(rg), bin(TimeGenerated, 5m), Computer
| join kind= leftouter(
MaintenanceVM_CL
| where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""
let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| extend Drive = strcat(Computer, ' - ', InstanceName)
| where ObjectName == "LogicalDisk" or ObjectName == "Logical Disk"
| where CounterName == "% Free Space"
| where InstanceName <> "_Total"
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by Computer, Drive, bin(TimeGenerated, 5m)
| join kind= leftouter(
MaintenanceVM_CL
| where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""
let utc_to_us_date_format = (t:datetime)
{
strcat(getmonth(t), "/", dayofmonth(t),"/", getyear(t), " ",
bin((t-1h)%12h+1h,1s), iff(t%24h<12h, " AM UTC", " PM UTC"))
};
Heartbeat
| where TimeGenerated < now()
| where SubscriptionId == '<subscriptionid>' and ResourceGroup == '<resourcegroup>'
| summarize TimeGenerated=max(TimeGenerated) by Computer
| project TimeGenerated, Computer
| extend localtimestamp = utc_to_us_date_format(TimeGenerated)
| extend LastHeartbeat = localtimestamp
| summarize AggregatedValue = count() by Computer, LastHeartbeat, TimeGenerated
| where TimeGenerated < ago(5m)
| join kind= leftouter(
MaintenanceVM_CL
| where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""
Filtering by Subscription and Resource Group allows to:
- Share the same Azure Monitor Logs Workspace bewteen subscriptions a nd resource groups. It is a good practice to keep the number of workspaces as low as posible.
- Configure different alert signal tresholds for different resource groups taht can map to environment and application.
- Configure different action groups for different Resource Groups.
Acknowledgements
This repo and the solution presented here would not be possible without the awesome guidance from Rob Kuehfus and Shannon Kuehn
References: https://docs.microsoft.com/en-us/azure/azure-monitor/log-query/logs-structure