🚧 This content is work in progress.
Designing complete monitoring solution requires that you understand the different scenarios and requirements, so that you can prepare you Azure components to match those requirements. Here are few thoughts how to approach that planning.
Typical monitoring solution is some form of combination of different scenarios listed below. Therefore, it makes sense to look them from scenario point of view.
You should know your requirements because ultimately they impact your monitoring solution.
Example: You're required to store certain application events for 5 years -> You have to think long-term storage such as Azure Storage account for storing those events. Log Analytics Workspace maximum data retension is 2 years (730 days).
Example: You need to provide chart about certain Azure resource metric data for last 8 months -> You have to store this metric data to logs since metric data is only available for 3 months (93 days).
If you have hard time planning your overall solution from technical components then you can try to use event modeling for help.
When planning and implementing your monitoring scenarios you typically follow these steps:
- Enable data collection
- Find correct data
- Create alert from data
- Create action from alert
- Visualize
- Test
- Automate
What you can't see, you can't measure. What you can't measure, you can't improve.
Quote from Enterprise-scale architecture operational design principles / Management and monitoring
Based on your monitoring scenario, you might need to enable data collection in
virtual machine (e.g. Windows Performance Counters: Process(*)\% Processor Time
for monitoring
processor usage per process) or in different Azure resource levels (e.g. push resource metrics
to Log Analytics Workspace).
Then you need to verify that indeed you're capable of finding correct data. In some scenarios that can be as simple as viewing metrics charts and in more advanced scenarios you need to find your data using KQL queries.
Example: Find CPU usage for process CalcService
(important background Windows Service):
Perf
| where ObjectName == "Process" and
CounterName == "% Processor Time" and
Computer == "vmname" and InstanceName == "CalcService"
When you have found your data which you want use for monitoring, you can follow these instruction for implementing your alerts: Overview of alerts in Microsoft Azure
Note: You can create rule when you find data and similarly if you don't find data.
Example: Find running process and if not found, then trigger alert.
Alert cause actions to trigger and for that we use action groups.
You should plan you action groups so that, you can reach correct target people who can actually do something for given alert.
Example: Your app relies on downstream API developed by another team inside your company. If that API starts to fail and your application is impacted, you can create action group that notifies that another API team directly.
Many times alerts and notifications are enough in order to start incident and troubleshooting process. Sometimes it greatly helps if you have some additional dashboards, workbooks or any other visualizations for clarifying the underlying conditions.
You can look for examples in microsoft/AzureMonitorCommunity repository.
In order to guarantee that query is correctly executed,
you have to of course test your implementation.
In above example it would mean that you close down
specific CalcService
Windows Service, which
should cause alert to fire.
To deploy these reliably across environments, you have to automate the deployment of the different components.
Here are few links for getting started with the automation:
- Resource Manager template samples for Azure Monitor
- Create a metric alert with a Resource Manager template
- Bicep example
Collect data from Azure resources with minimal effort and get alerted in specific conditions
- Create Log Analytics workspace for logs
- Set
Diagnostic settings
in Azure resources to send data to Log Analytics workspace - Create log based query alert to workspace
Log query can be then used for creating alerts:
In above example webhook
is called when alert is fired.
Read more about all available actions in action groups.
Here are few example queries:
Find failed Logic Apps integrations:
AzureDiagnostics
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where Level == "Error"
Find specific custom exception:
AppExceptions
| where ExceptionType == "ContosoRetailBackendException"
- Can be managed in scale using Azure Policies
- Some resources support resources specific schema
- Application Insights can use workspace for data storage (don't need to use diagnostic setting in that case)
- Each 5-min interval based query alert costs $1.50 per month
- Try to create
general
query alerts ("Find Logic Apps Errors") vs. very specific query which get multiplied by customer by product by xyz (causing n number of queries)
- Try to create
If you have single application already using Application Insights, then you can have similar query based alert in that:
Create metric based alerts for Azure resources
- Find Azure resource metric that you want to monitor
- Create metric based alert to that resource
Here are few examples:
Failed runs in Logic Apps resource:
Exception count in Application Insights resource:
DTU (Database transaction unit) usage is high in SQL Database:
- Alerts have state
and platform automatically changes the state from
Fired
toResolved
when condition clears- You get notified when state changes to
Resolved
- You get notified when state changes to
- Limited filtering available for metrics (dimensions of metrics)
- Example: You cannot create alert only for specific exceptions in App Insights using metric alerts
- Metric based alert costs $0.10 per monitored signal per month
- Use common alert schema
Enable custom processing based on Azure resource metric or log data
- Create Event Hub and Azure Functions resources
- Azure Function listens incoming data from Event Hub
- Deploy custom processing logic to Azure Functions
- Set
Diagnostic settings
in Azure resources to send data to your Event Hub
Here is example:
- Requires custom development
- Simplified example about Event Hub Forwarder src/EventHubListener/EventHubForwarderFunction.cs
- Full flexibility and control
- Diagnostic settings can be be managed in scale using Azure Policies
- You can use
Scenario 1
for large scale monitoring solution and extend that with this more custom based solution for only selected events to optimize certain automation scenarios- You can have up to 5 diagnostic settings applied to Azure resource
Minimize latency from event to action
- Create Event Hub and Azure Functions resources
- Azure Function listens incoming data from Event Hub
- Deploy custom processing logic to Azure Functions
- Use custom endpoint directly from you applications
Here is example:
- Heavy on custom development
- Very low latency
- Makes sense if action is automated
- E.g. Call API when certain event or metric threshold is met
- Hard to justify, if action causes humans to do corrective actions
- You need to create reusable code do this in multiple applications
- E.g. Nuget package for your .NET apps
Azure Master Class Part 9 - Monitoring and Security
End-to-end correlation across Logic Apps
Logic Apps and 'x-ms-client-tracking-id'
Log data ingestion time in Azure Monitor
Alert triggered by partial data
You can have up to 5 diagnostic settings applied to Azure resource.
If you're configuring diagnostic settings for your resource, you might get following error:
Failed to update diagnostics for 'monitoringdemo'.
{
"code":"Conflict",
"message": "Data sink '/subscriptions/<id>/resourceGroups/<rg>/providers/Microsoft.EventHub/namespaces/<ns>/authorizationrules/RootManageSharedAccessKey'
is already used in diagnostic setting 'monitoring' for category 'AppExceptions'.
Data sinks can't be reused in different settings on the same category for the same resource."
}.
It means that you cannot create multiple diagnostic settings with same category targeting same destination.
And in event hub scenario it includes authorizationrules/<your access key>
part.
Following is not allowed:
AppEvents
andAppExceptions
to Event Hub namespacens
and event hubeh1
usingRootManageSharedAccessKey
AppDependencies
andAppExceptions
to Event Hubns
and event hubeh2
usingRootManageSharedAccessKey
Following is allowed:
AppEvents
andAppExceptions
to Event Hub namespacens
and event hubeh1
usingeh1Policy
AppDependencies
andAppExceptions
to Event Hubns
and event hubeh2
usingeh2Policy
Read more about correlation in monitoring.