/aks-logging-cost-optimisations

Bundling up solutions for cost optimisation automation for AKS logging

Primary LanguageBicep

AKS Logging Optimisations

Overview

Content to assist customers in optimizing logging for workloads that run on Azure Kubernetes Service.

Data Collection Rule

Log Analytics Workspace support on-ingestion transformation for specific tables. Luckily the ContainerLog and ContainerLogv2 tables are both supported. This feature means you can reduce the logging cost of your AKS clusters by filtering out unneeded logs. Additionally, it allows you to drop any logs containing Personal Identifiable Information before they are ingested.

Guide

Investigate the logs being generated by pods in your cluster and create a list of messages that you'd like to exclude from ingestion.

Some suggestions are:

  • kube-probe: As every pod will have a readiness and liveness health check setup, usually every 10seconds or so. Log messages generated by these checks can be noisy.
  • /health: Similar to the above kube-probe comment, if the pod logs out it's health check path or a success message.

By removing even just the above 2 examples you might be able to dramatically reduce your log ingestion passively. It all depends on what your workloads are logging.

A Data Collection Rule to exclude the above would look like:

source | where LogMessage !has "health"| where LogMessage !has "kube-probe"

This repository contains Bicep modules to deploy the above DCR filters on a Log Analytics workspace. Update the parameters in the dcr.bicepparam file. The code provided is an example, use at your own desecration as it's only provided as a quick start.

dcr

To make the most of the DCR, review your Pod Logs and identify any noisy / unhelpful log messages. Ideally applications should only log useful information. To quick start reviewing your pod logs import my AKS Workbook and navigate to the logs table.

example-pod-logs-audit

Note: Always test the query you're adding to the DRC before implementing it. Adding a has KQL query may exclude more logs than intended.

Note: DCRs take 30 minutes to take affect after the initial creation.

Note: There may already be a DCR on your cluster for various reasons.

Documentation

Azure Diagnostic Logs

The azureDiagnostics.json file contains an Azure Workbook which audits Azure Diagnostic settings that are ingested into Log Analytics Workspaces. It visualises the data and provides data driven insights into your environment.

diagnositcs

For AKS in particular it displays whether the kube-audit and kube-admin-audit logs are both being captured for a cluster. The kube-audit-admin log is a subnet of kube-audit. Customers should enable one or the other, not both.

kube-audit

Container Insights

Workload Logs

When a container writes a message to stdout or stderr it will be picked up and captured by ContainerInsights (by default, when enabled on a cluster).

This can cause multiline log messages to be ingested as separate log messages, a crude example of this is (to visualize the issue):

badlog

The above example shows a container logging a non-stringified payload to STDOUT. The payload contains a stacktrace (error dump). It causes the log message to be split into multiple lines. This:

  1. Increases the cost as each row has metadata.
  2. Makes it challenging to troubleshoot as the data isn't grouped correctly.

This log message should instead be written to STDOUT or STDERR as a stringified JSON object. ContainerInsights will pickup and store the log like:

correct-log

Note: The above is just an example of a structured log message. For guidance on application logging review the following guide.

ContainerLogV2

There's some new features that have been released for ContainerInsights that are aimed at reducing the logging costs. These benefits are available when you upgrade to the new ContainerLogV2 format. A few useful features are:

  • Multiline Log Parsing
  • Reducing Ingestion Frequency
  • Basic logs

Configuration Options

For more information review the cost optimisation document.

But always look at the Container Insights collection configmap and disable settings you don't need.

  • An example could be that you've deployed Prometheus into your AKS cluster and you've still got AMA scrapping pods with Prometheus Annotations.