/monitoring

Primary LanguagePowerShellMIT LicenseMIT

Custom Monitoring for Azure VMs with Azure Monitor Logs

Azure Monitor provides many "out of the box" metrics that you can use to monitor your VMs. However, when the requirements get more specific, we often have to use Kusto Queries.

Business Case

This repository provides a solution for the following scenario:
A company wants to use Azure Monitor and alerts for the following events:
  1. Alert when a VM has high CPU utilization. Over 90% for 5 minutes. Filtered by Subscription and Resource Group.
  2. Alert when a VM has low memory. Less than 200MB available for 5 minutes. Filtered by Subscription and Resource Group.
  3. Alert when a VM has low disk space. Less that 10% Available. Filtered by Subscription and Resource Group.
  4. Alert when a VM is down, the agent is not reporting, or is generally unhealthy for a period of 5 minutes. Filtered by Subscription and Resource Group.
The alerts must exclude a list of VMs that are under a maintenance window. Each VM may have a different maintenance window
The maintenance window is specified in VM tag 'maintenance' with the format "zz-xddd-hhmm-HHMM-", where:
zzNot used for alerting Purpose
xn-th weekday of the month when maintenance is performed every month. As in 3rd Wednesday, x=3
dddFirst 3 letters of the weekday when maintenance is performed
hhHour of the day when maintenance starts
mmHour of the day when maintenance starts
HHHour of the day when maintenance ends
MMHour of the day when maintenance ends
ZNot used for alerting purpose
This way, a VM with scheduled maintenance every 2nd Tuesday of the month from 10PM to 11:30PM, would have the 'maintenance' tag value "zz-2tue-1000-1130-w"
ALL TIMES TIMES ARE IN UTC.
In addition to having scheduled maintenance, it is desirable to be able to put a VM under maintenance alert excemption on demand. This way any VM can be temporarily removed from the alerts at any time for an indefinite period of time (not implemented in the solution yet).

Solution

  1. Azure Automation Runbook scheduled to run every hour to:
    • Query VM Tag "maintenance" and determine the list of VMs that will be in maintenence in the next hour. The runbook will run 10 minutes before each hour (0:50, 1:50, 2:50) and will run 10 minutes ahead for determining maintenance VMs. This way it will scoop the right VMs in the right maintenance window and allow 10 minutes for Az Automation job queuing, job completion, and Log Analytics record ingestion. The end result will be that alerts will stop just a few minutes before the actual maintenance time.
    • Send the list of VMs to a Custom Table in Log Analytics. The table is named MaintenanceVM_CL.
  2. Kusto queries for alerts joining the MaintenanceVM_CL custom table on Computer, taking into account the new maintenance records for the last hour so those VMs can be excluded from the query results.

Solution Components

maintenance.psm1Powershell module to be uploaded to your Azure Automation Account. This code is used by the Automation runbook
maintschedule.ps1Automation runbook to be uploaded and scheduled every hour at the 50 min offset. Workspace Id and Shared Key need to be updated with your Log Analytics info.
maintondemand.ps1Automation runbook to be uploaded and triggered through a Web Hook with VM Name, StartDtm and EndDtm. Workspace Id and Shared Key need to be updated with your Log Analytics info.
Log Analytics AlertsThe code for the queries is provided below in the section Alert Implementation with Kusto Queries. The subscription id and resource group need to be updated with your values.

Limitations

  • The solution works as long as every VM in the organization has a different name. Good naming standards are a best practice. Any collision with VM names would adversely affect this implementation. A potential solution to this issue is to use resourceId instead of VM Name
  • The 'maintenance' tag requires the use of UTC. This is for code simplification and the fact that Log Analytics uses UTC times. Azure runs in UTC so your Azure operations should too.

Pre-requisites

  • A Log Analytics Workspace.
  • All VMs that will be monitored have to be enrolled in the Workspace

Alert Implementation with Kusto Queries

Go to Azure Monitor > Alerts
For each Alert:
  1. Click on New Alert Rule
  2. Click on Resource "Select" Button.
  3. Filter by resource type "Log Analytics Workspaces" and select the Workspace.
  4. Click on Condition "Add" button and Select Custom log search as Signal Name
  5. Configure the Kusto query as described below for each case.
  6. Configure Actions
  7. Click on Create Alert Rule

CPU High

let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| where ObjectName == 'Processor' and CounterName == '% Processor Time'
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by tostring(sub), tostring(rg), bin(TimeGenerated, 5m), Computer
| join kind= leftouter(
    MaintenanceVM_CL 
    | where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""

Low Memory

let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| where ObjectName == 'Memory' and CounterName == 'Available MBytes' 
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by tostring(sub), tostring(rg), bin(TimeGenerated, 5m), Computer
| join kind= leftouter(
    MaintenanceVM_CL 
    | where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""

Low Disk Space

let get_rg = (s:string)
{
split((s), "/", 4)
};
let get_sub = (s:string)
{
split((s), "/", 2)
};
Perf
| extend rg = get_rg(_ResourceId)[0]
| extend sub = get_sub(_ResourceId)[0]
| extend Drive = strcat(Computer, ' - ', InstanceName)
| where ObjectName == "LogicalDisk" or ObjectName == "Logical Disk"
| where CounterName == "% Free Space"
| where InstanceName <> "_Total"
| where sub == '<subscriptionid>' and rg == '<resourcegroup>'
| where TimeGenerated >= now(-10m)
| summarize AggregatedValue = avg(CounterValue) by Computer, Drive, bin(TimeGenerated, 5m)
| join kind= leftouter(
    MaintenanceVM_CL 
    | where now() >= StartTime_t and now() <= EndTime_t
) on Computer
| where MaintenanceType_s == ""

VM Down

let utc_to_us_date_format = (t:datetime)
{
strcat(getmonth(t), "/", dayofmonth(t),"/", getyear(t), " ",
bin((t-1h)%12h+1h,1s), iff(t%24h<12h, " AM UTC", " PM UTC"))
};
Heartbeat
| where TimeGenerated < now()
| where SubscriptionId == '<subscriptionid>' and ResourceGroup == '<resourcegroup>'
| summarize TimeGenerated=max(TimeGenerated) by Computer
| project TimeGenerated, Computer
| extend localtimestamp = utc_to_us_date_format(TimeGenerated)
| extend LastHeartbeat = localtimestamp
| summarize AggregatedValue = count() by Computer, LastHeartbeat, TimeGenerated
| where TimeGenerated < ago(5m)
| join kind= leftouter(
    MaintenanceVM_CL 
    | where now() >= StartTime_t and now() <= EndTime_t
) on Computer 
| where MaintenanceType_s == ""


Filtering by Subscription and Resource Group allows to:

  • Share the same Azure Monitor Logs Workspace bewteen subscriptions a nd resource groups. It is a good practice to keep the number of workspaces as low as posible.
  • Configure different alert signal tresholds for different resource groups taht can map to environment and application.
  • Configure different action groups for different Resource Groups.
Ultimately, if the Subscription or Resource Group filter is not useful for the reader, this can be removed from the Kusto Query.
Acknowledgements
This repo and the solution presented here would not be possible without the awesome guidance from Rob Kuehfus and Shannon Kuehn
References: https://docs.microsoft.com/en-us/azure/azure-monitor/log-query/logs-structure