/azure-monitoring-examples

Azure Monitoring examples

Primary LanguagePowerShellMIT LicenseMIT

Azure Monitoring examples

🚧 This content is work in progress.

Designing complete monitoring solution requires that you understand the different scenarios and requirements, so that you can prepare you Azure components to match those requirements. Here are few thoughts how to approach that planning.

Planning

Typical monitoring solution is some form of combination of different scenarios listed below. Therefore, it makes sense to look them from scenario point of view.

You should know your requirements because ultimately they impact your monitoring solution.

Example: You're required to store certain application events for 5 years -> You have to think long-term storage such as Azure Storage account for storing those events. Log Analytics Workspace maximum data retension is 2 years (730 days).

Example: You need to provide chart about certain Azure resource metric data for last 8 months -> You have to store this metric data to logs since metric data is only available for 3 months (93 days).

If you have hard time planning your overall solution from technical components then you can try to use event modeling for help.

Implementation steps

When planning and implementing your monitoring scenarios you typically follow these steps:

  1. Enable data collection
  2. Find correct data
  3. Create alert from data
  4. Create action from alert
  5. Visualize
  6. Test
  7. Automate

1. Enable data collection

What you can't see, you can't measure. What you can't measure, you can't improve.

Quote from Enterprise-scale architecture operational design principles / Management and monitoring

Based on your monitoring scenario, you might need to enable data collection in virtual machine (e.g. Windows Performance Counters: Process(*)\% Processor Time for monitoring processor usage per process) or in different Azure resource levels (e.g. push resource metrics to Log Analytics Workspace).

2. Find correct data

Then you need to verify that indeed you're capable of finding correct data. In some scenarios that can be as simple as viewing metrics charts and in more advanced scenarios you need to find your data using KQL queries.

Example: Find CPU usage for process CalcService (important background Windows Service):

Perf
| where ObjectName == "Process" and
        CounterName == "% Processor Time" and
        Computer == "vmname" and InstanceName == "CalcService"

3. Create alert from data

When you have found your data which you want use for monitoring, you can follow these instruction for implementing your alerts: Overview of alerts in Microsoft Azure

Note: You can create rule when you find data and similarly if you don't find data.

Example: Find running process and if not found, then trigger alert.

4. Create action from alert

Alert cause actions to trigger and for that we use action groups.

You should plan you action groups so that, you can reach correct target people who can actually do something for given alert.

Example: Your app relies on downstream API developed by another team inside your company. If that API starts to fail and your application is impacted, you can create action group that notifies that another API team directly.

5. Visualize

Many times alerts and notifications are enough in order to start incident and troubleshooting process. Sometimes it greatly helps if you have some additional dashboards, workbooks or any other visualizations for clarifying the underlying conditions.

You can look for examples in microsoft/AzureMonitorCommunity repository.

6. Test

In order to guarantee that query is correctly executed, you have to of course test your implementation. In above example it would mean that you close down specific CalcService Windows Service, which should cause alert to fire.

7. Automate

To deploy these reliably across environments, you have to automate the deployment of the different components.

Here are few links for getting started with the automation:

Scenarios

Scenario 1

What

Collect data from Azure resources with minimal effort and get alerted in specific conditions

How

  • Create Log Analytics workspace for logs
  • Set Diagnostic settings in Azure resources to send data to Log Analytics workspace
  • Create log based query alert to workspace

Monitoring architecture

Log query can be then used for creating alerts:

Log Analytics log query alert

In above example webhook is called when alert is fired.

Read more about all available actions in action groups.

Here are few example queries:

Find failed Logic Apps integrations:

AzureDiagnostics 
| where OperationName == "Microsoft.Logic/workflows/workflowRunCompleted"
| where Level == "Error"

Find specific custom exception:

AppExceptions
| where ExceptionType == "ContosoRetailBackendException"

Notes

If you have single application already using Application Insights, then you can have similar query based alert in that:

App Insights Log Alert

Scenario 2

What

Create metric based alerts for Azure resources

How

  • Find Azure resource metric that you want to monitor
  • Create metric based alert to that resource

Here are few examples:

Failed runs in Logic Apps resource:

App Insights Metric Alert

Exception count in Application Insights resource:

App Insights Metric Alert

DTU (Database transaction unit) usage is high in SQL Database:

App Insights Metric Alert

Notes

  • Alerts have state and platform automatically changes the state from Fired to Resolved when condition clears
    • You get notified when state changes to Resolved
  • Limited filtering available for metrics (dimensions of metrics)
    • Example: You cannot create alert only for specific exceptions in App Insights using metric alerts
  • Metric based alert costs $0.10 per monitored signal per month
  • Use common alert schema

Scenario 3

What

Enable custom processing based on Azure resource metric or log data

How

  • Create Event Hub and Azure Functions resources
  • Azure Function listens incoming data from Event Hub
  • Deploy custom processing logic to Azure Functions
  • Set Diagnostic settings in Azure resources to send data to your Event Hub

Here is example:

Diagnostic Settings and Event Hub Custom Forwarder

Notes

  • Requires custom development
  • Full flexibility and control
  • Diagnostic settings can be be managed in scale using Azure Policies
  • You can use Scenario 1 for large scale monitoring solution and extend that with this more custom based solution for only selected events to optimize certain automation scenarios
    • You can have up to 5 diagnostic settings applied to Azure resource

Scenario 4

What

Minimize latency from event to action

How

  • Create Event Hub and Azure Functions resources
  • Azure Function listens incoming data from Event Hub
  • Deploy custom processing logic to Azure Functions
  • Use custom endpoint directly from you applications

Here is example:

Custom diagnostics with Event Hub Custom Forwarder

Notes

  • Heavy on custom development
  • Very low latency
  • Makes sense if action is automated
    • E.g. Call API when certain event or metric threshold is met
    • Hard to justify, if action causes humans to do corrective actions
  • You need to create reusable code do this in multiple applications
    • E.g. Nuget package for your .NET apps

Additinal notes

Blogs, articles and videos on the topic

Azure Master Class Part 9 - Monitoring and Security

End-to-end correlation across Logic Apps

Logic Apps and 'x-ms-client-tracking-id'

Pricing

Azure Monitor Pricing

Azure Pricing Calculator

Data ingestion

Log data ingestion time in Azure Monitor

Alert triggered by partial data

Limits

You can have up to 5 diagnostic settings applied to Azure resource.

Azure Monitor service limits

Data sink conflict

If you're configuring diagnostic settings for your resource, you might get following error:

Failed to update diagnostics for 'monitoringdemo'.
{
  "code":"Conflict",
  "message": "Data sink '/subscriptions/<id>/resourceGroups/<rg>/providers/Microsoft.EventHub/namespaces/<ns>/authorizationrules/RootManageSharedAccessKey'
  is already used in diagnostic setting 'monitoring' for category 'AppExceptions'.
  Data sinks can't be reused in different settings on the same category for the same resource."
}.

It means that you cannot create multiple diagnostic settings with same category targeting same destination. And in event hub scenario it includes authorizationrules/<your access key> part.

Following is not allowed:

  • AppEvents and AppExceptions to Event Hub namespace ns and event hub eh1 using RootManageSharedAccessKey
  • AppDependencies and AppExceptions to Event Hub ns and event hubeh2 using RootManageSharedAccessKey

Following is allowed:

  • AppEvents and AppExceptions to Event Hub namespace ns and event hub eh1 using eh1Policy
  • AppDependencies and AppExceptions to Event Hub ns and event hubeh2 using eh2Policy

Correlation

Read more about correlation in monitoring.