Azure Monitoring examples

🚧 This content is work in progress.

Designing complete monitoring solution requires that you understand the different scenarios and requirements, so that you can prepare you Azure components to match those requirements. Here are few thoughts how to approach that planning.

Planning

Typical monitoring solution is some form of combination of different scenarios listed below. Therefore, it makes sense to look them from scenario point of view.

You should know your requirements because ultimately they impact your monitoring solution.

Example: You're required to store certain application events for 5 years -> You have to think long-term storage such as Azure Storage account for storing those events. Log Analytics Workspace maximum data retension is 2 years (730 days).

Example: You need to provide chart about certain Azure resource metric data for last 8 months -> You have to store this metric data to logs since metric data is only available for 3 months (93 days).

If you have hard time planning your overall solution from technical components then you can try to use event modeling for help.

Implementation steps

When planning and implementing your monitoring scenarios you typically follow these steps:

Enable data collection
Find correct data
Create alert from data
Create action from alert
Visualize
Test
Automate

1. Enable data collection

What you can't see, you can't measure. What you can't measure, you can't improve.

Quote from Enterprise-scale architecture operational design principles / Management and monitoring

Based on your monitoring scenario, you might need to enable data collection in virtual machine (e.g. Windows Performance Counters: Process(*)\% Processor Time for monitoring processor usage per process) or in different Azure resource levels (e.g. push resource metrics to Log Analytics Workspace).

2. Find correct data

Then you need to verify that indeed you're capable of finding correct data. In some scenarios that can be as simple as viewing metrics charts and in more advanced scenarios you need to find your data using KQL queries.

Example: Find CPU usage for process CalcService (important background Windows Service):

Perf
| where ObjectName == "Process" and
        CounterName == "% Processor Time" and
        Computer == "vmname" and InstanceName == "CalcService"

3. Create alert from data

When you have found your data which you want use for monitoring, you can follow these instruction for implementing your alerts: Overview of alerts in Microsoft Azure

Note: You can create rule when you find data and similarly if you don't find data.

Example: Find running process and if not found, then trigger alert.

4. Create action from alert

Alert cause actions to trigger and for that we use action groups.

You should plan you action groups so that, you can reach correct target people who can actually do something for given alert.

Example: Your app relies on downstream API developed by another team inside your company. If that API starts to fail and your application is impacted, you can create action group that notifies that another API team directly.

5. Visualize

Many times alerts and notifications are enough in order to start incident and troubleshooting process. Sometimes it greatly helps if you have some additional dashboards, workbooks or any other visualizations for clarifying the underlying conditions.

You can look for examples in microsoft/AzureMonitorCommunity repository.

6. Test

In order to guarantee that query is correctly executed, you have to of course test your implementation. In above example it would mean that you close down specific CalcService Windows Service, which should cause alert to fire.

7. Automate

To deploy these reliably across environments, you have to automate the deployment of the different components.

Here are few links for getting started with the automation:

Scenarios

Scenario 1

What

Collect data from Azure resources with minimal effort and get alerted in specific conditions

How

Create Log Analytics workspace for logs
Set Diagnostic settings in Azure resources to send data to Log Analytics workspace
Create log based query alert to workspace

Log query can be then used for creating alerts:

In above example webhook is called when alert is fired.

Read more about all available actions in action groups.

Here are few example queries:

Find failed Logic Apps integrations:

AzureDiagnostics 
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where Level == "Error"

Find specific custom exception:

AppExceptions
| where ExceptionType == "ContosoRetailBackendException"

Notes

Can be managed in scale using Azure Policies
- Enterprise-Scale and Azure Policy for policy-driven governance
- Deploy Enterprise-Scale Azure policies
Some resources support resources specific schema
Application Insights can use workspace for data storage (don't need to use diagnostic setting in that case)
Each 5-min interval based query alert costs $1.50 per month
- Try to create general query alerts ("Find Logic Apps Errors") vs. very specific query which get multiplied by customer by product by xyz (causing n number of queries)

If you have single application already using Application Insights, then you can have similar query based alert in that:

Scenario 2

What

Create metric based alerts for Azure resources

How

Find Azure resource metric that you want to monitor
Create metric based alert to that resource

Here are few examples:

Failed runs in Logic Apps resource:

Exception count in Application Insights resource:

DTU (Database transaction unit) usage is high in SQL Database:

Notes

Alerts have state and platform automatically changes the state from Fired to Resolved when condition clears
- You get notified when state changes to Resolved
Limited filtering available for metrics (dimensions of metrics)
- Example: You cannot create alert only for specific exceptions in App Insights using metric alerts
Metric based alert costs $0.10 per monitored signal per month
Use common alert schema

Scenario 3

What

Enable custom processing based on Azure resource metric or log data

How

Create Event Hub and Azure Functions resources
Azure Function listens incoming data from Event Hub
Deploy custom processing logic to Azure Functions
Set Diagnostic settings in Azure resources to send data to your Event Hub

Here is example:

Notes

Requires custom development
- Simplified example about Event Hub Forwarder src/EventHubListener/EventHubForwarderFunction.cs
Full flexibility and control
Diagnostic settings can be be managed in scale using Azure Policies
You can use Scenario 1 for large scale monitoring solution and extend that with this more custom based solution for only selected events to optimize certain automation scenarios
- You can have up to 5 diagnostic settings applied to Azure resource

Scenario 4

What

Minimize latency from event to action

How

Create Event Hub and Azure Functions resources
Azure Function listens incoming data from Event Hub
Deploy custom processing logic to Azure Functions
Use custom endpoint directly from you applications

Here is example:

Notes

Heavy on custom development
Very low latency
Makes sense if action is automated
- E.g. Call API when certain event or metric threshold is met
- Hard to justify, if action causes humans to do corrective actions
You need to create reusable code do this in multiple applications
- E.g. Nuget package for your .NET apps

Additinal notes

Limits

You can have up to 5 diagnostic settings applied to Azure resource.

Azure Monitor service limits

Data sink conflict

If you're configuring diagnostic settings for your resource, you might get following error:

Failed to update diagnostics for 'monitoringdemo'.
{
  "code":"Conflict",
  "message": "Data sink '/subscriptions/<id>/resourceGroups/<rg>/providers/Microsoft.EventHub/namespaces/<ns>/authorizationrules/RootManageSharedAccessKey'
  is already used in diagnostic setting 'monitoring' for category 'AppExceptions'.
  Data sinks can't be reused in different settings on the same category for the same resource."
}.

It means that you cannot create multiple diagnostic settings with same category targeting same destination. And in event hub scenario it includes authorizationrules/<your access key> part.

Following is not allowed:

AppEvents and AppExceptions to Event Hub namespace ns and event hub eh1 using RootManageSharedAccessKey
AppDependencies and AppExceptions to Event Hub ns and event hubeh2 using RootManageSharedAccessKey

Following is allowed:

AppEvents and AppExceptions to Event Hub namespace ns and event hub eh1 using eh1Policy
AppDependencies and AppExceptions to Event Hub ns and event hubeh2 using eh2Policy

Correlation

Read more about correlation in monitoring.

JanneMattila/azure-monitoring-examples

Azure Monitoring examples

Planning

Implementation steps

1. Enable data collection

2. Find correct data

3. Create alert from data

4. Create action from alert

5. Visualize

6. Test

7. Automate

Scenarios

Scenario 1

What

How

Notes

Scenario 2

What

How

Notes

Scenario 3

What

How

Notes

Scenario 4

What

How

Notes

Additinal notes

Blogs, articles and videos on the topic

Pricing

Data ingestion

Limits

Data sink conflict

Correlation