/ekg

Essential Kubernetes Gauges

Primary LanguageHCLApache License 2.0Apache-2.0

EKG (Essential Kubernetes Gauges)

Monitor the health of a Kubernetes cluster with ease, using Service Level Objectives (SLOs).

Essential Kubernetes Gauges (EKG) provides a set of standardized, prefabricated SLOs that measure the reliability of a Kubernetes cluster. You can think of these SLOs as a check engine light that tells you when your EKS cluster is misbehaving, with a historical record of when the cluster was behaving as desired and when not. SLOs allow you to set adjustable goals for the reliability aspects of your clusters. EKG includes SLOs that measure several aspects of a cluster:

  • Control Plane Health
    • Is the Kubernetes API responding normally?
    • Is it performant?
  • Cluster Health
    • Are the nodes healthy?
    • Is there some minimum of resource headroom?
    • Can we start new workloads?
  • Workload Health
    • Is anything in a bad state?
    • Is there at least some kind of workload running?
  • Resource Efficiency
    • Are resources underutilized in this cluster?
    • Is the cluster scaling in such a way that it is making good use of resources, without endangering workloads?
  • Cost Efficiency measurements have been proposed as future enhancement and are under consideration.

These aspects of cluster behavior provide a gauge on how well Kubernetes (as an application) is running, as well as how the overall cluster is faring, along with some insight into the health of the workloads the cluster is supporting. By running EKG's SLOs, you can measure how well your clusters are doing over time, share numbers, charts, and reports on cluster reliability across your teams, managers, and stakeholders, and use all the features of Nobl9, including alert policies and alert integrations with a wide variety of popular tools.

For more information on the specific SLOs included in EKG and how to make use of them, please see the SLO docs

Supported Products

While EKG's SLOs and techniques are usable with any flavor of Kubernetes, and while the underlying instrumentation frameworks are compatible with many distributions, we currently provide end-to-end automation for the following products:

  • AWS EKS (Elastic Container Service for Kubernetes)

Support for additional Kubernetes distributions has been proposed and is under consideration.

Architecture

diagram

Deployment of EKG with default settings on an existing EKS cluster

Amazon Managed Service for Prometheus is created (optionally), and metrics are collected from electrodes:

  • kube-state-metrics - metrics for the health of the various objects inside, such as deployments, nodes and pods
  • node-problem-detector - metrics for the health of the node e.g. infrastructure daemon issues: ntp service down, hardware issues e.g. bad CPU, memory or disk, kernel issues e.g. kernel deadlock, corrupted file system, container runtime issues e.g. unresponsive runtime daemon
  • Kuberhealthy - metrics for the health of the cluster, performs synthetic tests that ensures daemonsets, deployments can be deployed, DNS resolves names, etc.

Metrics are plumbed into Amazon Managed Service for Prometheus using AWS Distro for OpenTelemetry (ADOT). It collects metrics exposed by other services too, thus it can be the only thing you need to collect all of the platform and application metrics from a cluster.

The repository gives fine-grain control over what can be deployed or reused. For instance, you can install only chosen electrodes, or all of them, reuse existing Amazon Managed Service for Prometheus, or install only ADOT. For details check the documentation of specific modules. Electrodes can be used with any Kubernetes cluster - EKS, GKE, on-premise, etc.

To learn how to contribute please read the contribution guidelines.

End-to-end installation of EKG on fresh EKS cluster

  1. Prerequisites. You will need

    • A Nobl9 Organization. If you don't already have a Nobl9 org, you can sign up for Nobl9 free edition at https://app.nobl9.com/signup/
    • Terraform. If you need to install it: https://developer.hashicorp.com/terraform/downloads
    • An AWS account, with configuration and credentials connecting it to Terraform, for example by installing AWS CLI
    • An EKS cluster. We assume you have a bunch of these, but if you want to spin up a fresh test cluster to try out EKG in isolation, how about following the steps in this tutorial? The tutorial defaults to Terraform Cloud (which is quite nice) but for this exercise we recommend you click on the Terraform OSS tabs as you proceed.
    • You need to configure IAM OIDC provider for the EKS cluster. Tutorial linked above does it for you, but if you are using an existing cluster, you may need to do it manually.
  2. Create a terraform.tfvars file. A staring point can be found in terraform.tfvars.example

    cp terraform.tfvars.example terraform.tfvars
    # edit that file with an editor of your choice
    # provide values for your AWS region, cluster name, and Nobl9 organization
  3. Provide required secrets to the Nobl9 Terraform Provider. In the Nobl9 web UI, go to Settings > Access Keys, create an access key (save it somewhere) and then set the values as env vars:

    export TF_VAR_nobl9_client_id="<your Nobl9 Client ID>"
    export TF_VAR_nobl9_client_secret="<your Nobl9 Client Secret>"
  4. Use Terraform to install the EKG components. In the root of this repository, run:

terraform init
terraform apply

Output of the above is

Apply complete! Resources: 36 added, 0 changed, 0 destroyed.

Outputs:

amp_ws_endpoint = "https://aps-workspaces.us-east-2.amazonaws.com/workspaces/ws-abcdef12-3456-7890-abcd-ef1234567890/"

Congratulations! EKG is up and running. Go back to the Nobl9 web UI and explore your newly created SLOs. For more information about these SLOs, how to use them, and how to tune them to your clusters' conditions and workloads, see the SLO docs

What all does this install?

Inside your EKS cluster

  • kube-state-metrics
  • Kuberhealthy
  • node-problem-detector
  • Amazon Distro for OpenTelemetry (ADOT)
  • A Nobl9 Agent compatible with and connecting to Amazon Managed Service for Prometheus

Inside your Nobl9 Organization

  • A Nobl9 Project and a Service to hold the SLOs

  • The EKG SLOs (several prefab SLOs for Kubernetes)

  • An agent-based Data Source configured to receive data from the Nobl9 Agent running in the EKS cluster

Elsewhere inside your AWS Account

  • An IAM User with access keys (configured in the Nobl9 Agent) and an inline policy allowing it to access the Amazon Managed Service for Prometheus workspace

  • An Amazon Managed Service for Prometheus workspace, available at the URL output as amp_ws_endpoint

The amp_ws_endpoint is a URL for Amazon Managed Service for Prometheus that can be directly used for instance in Grafana or Nobl9. Deploying Grafana or other visualization tools is not in the scope of this project, but if you are looking for a quick and clean Grafana to play with as you explore the metrics in your newly created Prometheus, how about running it locally from a docker image and include the additional env vars required to allow it to connect to Amazon's managed prom.

Or in short:

docker run -d -p 3000:3000 --name="grafana" -e "AWS_SDK_LOAD_CONFIG=true" -e "GF_AUTH_SIGV4_AUTH_ENABLED=true" grafana/grafana-oss

Then when you configure a Prometheus data source it will offer AWS specific settings to connect it to that amp_ws_endpoint

Helpful Links

Requirements

Name Version
terraform >= 1.1.0
aws >= 3.72
helm >= 2.4.1
kubernetes >= 2.10
nobl9 0.26.0

Providers

Name Version
aws 5.53.0
kubernetes 2.30.0

Modules

Name Source Version
adot_amp ./modules/adot-amp n/a
electrodes ./modules/electrodes n/a
nobl9 ./modules/nobl9 n/a

Resources

Name Type
kubernetes_namespace.this resource
aws_eks_cluster.this data source
aws_eks_cluster_auth.this data source

Inputs

Name Description Type Default Required
api_error_rate_target SLO reliability target for apiserver error rate number n/a yes
api_latency_target SLO reliability target for apiserver latency number n/a yes
aws_region AWS Region string n/a yes
create_namespace Namespace speciefed by variable namespace will be created, if not exists bool true no
eks_cluster_id Name of the EKS cluster string n/a yes
enable_kube_state_metrics_slos Would you like to include SLOs that rely on kube-state-metrics? bool n/a yes
enable_kuberhealthy_slos Would you like to include SLOs that rely on kuberhealthy metrics? bool n/a yes
enable_node_problem_detector_slos Would you like to include SLOs that rely on node-problem-detector? bool n/a yes
kuberhealthy_pod_start_latency Allowable duration of kuberhealthy check number n/a yes
kuberhealthy_pod_start_latency_target SLO reliability target for kuberhealthy pod start latency number n/a yes
kuberhealthy_pod_start_success_target SLO reliability target for kuberhealthy pod start success number n/a yes
managed_prometheus_workspace_id Amazon Managed Service for Prometheus Workspace IDAmazon Managed Service for Prometheus Workspace ID (when nothing passed new will be created) string "" no
memory_headroom_target SLO reliability target for memory headroom number n/a yes
memory_headroom_threshold Maximum allowable memory consumption (fraction of 1.0) number n/a yes
memory_utilization_target SLO reliability target for memory utilization number n/a yes
memory_utilization_threshold Minimum allowable memory consumption (fraction of 1.0) number n/a yes
namespace Namespace where electrodes will be installed string "ekg-monitoring" no
nobl9_client_id Nobl9 Client ID (create in Nobl9 web app using Settings > Access Keys) string n/a yes
nobl9_client_secret Nobl9 Client Secret (create in Nobl9 web app using Settings > Access Keys) string n/a yes
nobl9_organization_id Nobl9 Organization ID (visible in Nobl9 web app under Settings > Account) string n/a yes
nobl9_project_name Nobl9 Project name (create one in the Nobl9 web app using Catalog > Projects, or use 'default') string "default" no
node_not_ready_target SLO reliability target for node not ready number n/a yes
node_not_ready_threshold Maximum allowable number of nodes in a non-ready state number n/a yes
node_problem_target SLO reliability target for node problems number n/a yes
node_problem_threshold Number of nodes problems that is considered a problem for the cluster number n/a yes
phase_failed_unknown_target SLO reliability target for not having pods in a bad state (failed or unknown) number n/a yes
phase_failed_unknown_threshold This many pods in a bad state (failed or unknown) is considered a failing workload number n/a yes
phase_pending_target SLO reliability target for not having pods in a pending state number n/a yes
phase_pending_threshold Maximum allowable number of pods in a pending state number n/a yes
phase_running_target SLO reliability target for having enough things running in the cluster number n/a yes
phase_running_threshold Minimum allowable ratio of pods that should be running number n/a yes
rolling_window_days Evaluation time period for EKG SLOs number n/a yes
tags Additional tags (e.g. map('BusinessUnit,XYZ) map(string) {} no

Outputs

Name Description
amp_ws_endpoint Amazon Managed Prometheus endpoint