/active-monitor

Provides deep monitoring and self-healing of Kubernetes clusters

Primary LanguageGoApache License 2.0Apache-2.0

Active-Monitor

Maintenance PR slack

Go Report Card Build Status Code Coverage Latest Version

Motivation

Active-Monitor is a Kubernetes custom resource controller which enables deep cluster monitoring and self-healing using Argo workflows.

While it is not too difficult to know that all entities in a cluster are running individually, it is often quite challenging to know that they can all coordinate with each other as required for successful cluster operation (network connectivity, volume access, etc).

Overview

Active-Monitor will create a new health namespace when installed in the cluster. Users can then create and submit HealthCheck object to the Kubernetes server. A HealthCheck / Remedy is essentially an instrumented wrapper around an Argo workflow.

The HealthCheck workflow is run periodically, as defined by repeatAfterSec or a schedule: cron property in its spec, and watched by the Active-Monitor controller.

Active-Monitor sets the status of the HealthCheck CR to indicate whether the monitoring check succeeded or failed. If in case the monitoring check failed then the Remedy workflow will execute to fix the issue. Status of Remedy will be updated in the CR. External systems can query these CRs and take appropriate action if they failed.

RemedyRunsLimit parameter allows to configure how many times a remedy should be run. If Remedy action fails for any reason it will stop on further retries. It is an optional parameter. If it is not set Remedyworkflow is triggered whenever HealthCheck workflow fails.

RemedyResetInterval parameter allows resetting remedy after the reset interval time and RemedyWorkflow can be retried again in case monitor workflow fails. If remedy reaches a RemedyRunsLimit it will be reset when HealthCheck passes in any subsequent run before RemedyResetInterval.

Typical examples of such workflows include tests for basic Kubernetes object creation/deletion, tests for cluster-wide services such as policy engines checks, authentication and authorization checks, etc.

The sort of HealthChecks one could run with Active-Monitor are:

  • verify namespace and deployment creation
  • verify AWS resources are using < 80% of their instance limits
  • verify kube-dns by running DNS lookups on the network
  • verify kube-dns by running DNS lookups on localhost
  • verify KIAM agent by running aws sts get-caller-identity on all available nodes
  • verify if pod max threads has reached
  • verify if storage volume for a pod (e.g: prometheus) has reached its capacity.
  • verify if critical pods e.g: calico, kube-dns/core-dns pods are in a failed or crashloopbackoff state

With the Cluster/Namespace level, healthchecks can be run in any namespace provided namespace is already created. The level in the HealthCheck spec defines at which level it runs; it can be either Namespace or Cluster.

When level is set to Namespace, Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the Role and RoleBinding with namespace level permissions so that the HealthChecks in a namespace can be performed.

When the level is set to be Cluster the Active-Monitor will create a ServiceAccount in the namespace as defined in the workflow spec, it will also create the ClusterRole and ClusterRoleBinding with cluster level permissions so that the HealthChecks in a cluster scope can be performed.

Dependencies

Installation Guide

# step 0: ensure that all dependencies listed above are installed or present

# step 1: install argo workflow controller
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-argo.yaml

# step 2: install active-monitor CRD and start controller
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/config/crd/bases/activemonitor.keikoproj.io_healthchecks.yaml
kubectl apply -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/deploy/deploy-active-monitor.yaml

Alternate Install - using locally cloned code

# step 0: ensure that all dependencies listed above are installed or present

# step 1: install argo workflow-controller
kubectl apply -f deploy/deploy-argo.yaml

# step 2: install active-monitor controller
make install
kubectl apply -f deploy/deploy-active-monitor.yaml

# step 3: run the controller via Makefile target
make run

Usage and Examples

Create a new healthcheck:

Example 1:

Create a new healthcheck with cluster level bindings to specified serviceaccount and in health namespace:

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello.yaml

OR with local source code:

kubectl create -f examples/inlineHello.yaml

Then, list all healthchecks:

kubectl get healthcheck -n health OR kubectl get hc -n health

NAME                 LATEST STATUS   SUCCESS CNT     FAIL CNT    AGE
inline-hello-7nmzk   Succeeded        7               0          7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n health

...
Status:
  Failed Count:              0
  Finished At:               2019-08-09T22:50:57Z
  Last Successful Workflow:  inline-hello-4mwxf
  Status:                    Succeeded
  Success Count:             13
Events:                      <none>

Example 2:

Create a new healthcheck with namespace level bindings to specified serviceaccount and in a specified namespace:

kubectl create ns test

kubectl create -f https://raw.githubusercontent.com/keikoproj/active-monitor/master/examples/inlineHello_ns.yaml

OR with local source code:

kubectl create -f examples/inlineHello_ns.yaml

Then, list all healthchecks:

kubectl get healthcheck -n test OR kubectl get hc -n test

NAME                 LATEST STATUS   SUCCESS CNT     FAIL CNT    AGE
inline-hello-zz5vm  Succeeded         7               0          7m53s

View additional details/status of a healthcheck:

kubectl describe healthcheck inline-hello-zz5vm -n test

...
Status:
  Failed Count:              0
  Finished At:               2019-08-09T22:50:57Z
  Last Successful Workflow:  inline-hello-4mwxf
  Status:                    Succeeded
  Success Count:             13
Events:                      <none>

argo list -n test

NAME                 STATUS      AGE   DURATION   PRIORITY
inline-hello-88rh2   Succeeded   29s   7s         0
inline-hello-xpsf5   Succeeded   1m    8s         0
inline-hello-z8llk   Succeeded   2m    7s         0

Generates Resources

  • activemonitor.keikoproj.io/v1alpha1/HealthCheck
  • argoproj.io/v1alpha1/Workflow

Sample HealthCheck CR:

apiVersion: activemonitor.keikoproj.io/v1alpha1
kind: HealthCheck
metadata:
  generateName: dns-healthcheck-
  namespace: health
spec:
  repeatAfterSec: 60
  description: "Monitor pod dns connections"
  workflow:
    generateName: dns-workflow-
    resource:
      namespace: health
      serviceAccount: activemonitor-controller-sa
      source:
        inline: |
            apiVersion: argoproj.io/v1alpha1
            kind: Workflow
            spec:
              ttlSecondsAfterFinished: 60
              entrypoint: start
              templates:
              - name: start
                retryStrategy:
                  limit: 3
                container: 
                  image: tutum/dnsutils
                  command: [sh, -c]
                  args: ["nslookup www.google.com"]

Sample RemedyWorkflow CR:

apiVersion: activemonitor.keikoproj.io/v1alpha1
kind: HealthCheck
metadata:
  generateName: fail-healthcheck-
  namespace: health
spec:
  repeatAfterSec: 60 # duration in seconds
  level: cluster
  workflow:
    generateName: fail-workflow-
    resource:
      namespace: health # workflow will be submitted in this ns
      serviceAccount: activemonitor-healthcheck-sa # workflow will be submitted using this
      source:
        inline: |
            apiVersion: argoproj.io/v1alpha1
            kind: Workflow
            metadata:
              labels:
                workflows.argoproj.io/controller-instanceid: activemonitor-workflows
            spec:
              ttlSecondsAfterFinished: 60
              entrypoint: start
              templates:
              - name: start
                retryStrategy:
                  limit: 1
                container: 
                  image: ravihari/ctrmemory:v2
                  command: ["python"]
                  args: ["promanalysis.py", "http://prometheus.system.svc.cluster.local:9090", "health", "memory-demo", "memory-demo-ctr", "95"]
  remedyworkflow:
    generateName: remedy-test-
    resource:
      namespace: health # workflow will be submitted in this ns
      serviceAccount: activemonitor-remedy-sa # workflow will be submitted using this acct
      source:
        inline: |
          apiVersion: argoproj.io/v1alpha1
          kind: Workflow
          spec:
            ttlSecondsAfterFinished: 60
            entrypoint: kubectl
            templates:
              -
                container:
                  args: ["kubectl delete po/memory-demo"]
                  command: ["/bin/bash", "-c"]
                  image: "ravihari/kubectl:v1"
                name: kubectl

Active-Monitor Architecture

Access Workflows on Argo UI

kubectl -n health port-forward deployment/argo-ui 8001:8001

Then visit: http://127.0.0.1:8001

Prometheus Metrics

Active-Monitor controller also exports metrics in Prometheus format which can be further used for notifications and alerting.

Prometheus metrics are available on :8080/metrics

kubectl -n health port-forward deployment/activemonitor-controller 8080:8080

Then visit: http://localhost:8080/metrics

Active-Monitor, by default, exports following Promethus metrics:

  • healthcheck_success_count - The total number of successful healthcheck resources
  • healthcheck_error_count - The total number of erred healthcheck resources
  • healthcheck_runtime_seconds - Time taken for the healthcheck's workflow to complete

Active-Monitor also supports custom metrics. For this to work, your workflow should export a global parameter. The parameter will be programmatically available in the completed workflow object under: workflow.status.outputs.parameters.

The global output parameters should look like below:

"{\"metrics\":
  [
    {\"name\": \"custom_total\", \"value\": 123, \"metrictype\": \"gauge\", \"help\": \"custom total\"},
    {\"name\": \"custom_metric\", \"value\": 12.3, \"metrictype\": \"gauge\", \"help\": \"custom metric\"}
  ]
}"

❤ Contributing ❤

Please see CONTRIBUTING.md.

To add a new example of a healthcheck and/or workflow:

Release Process

Please see RELEASE.

License

The Apache 2 license is used in this project. Details can be found in the LICENSE file.

Other Keiko Projects

Instance Manager - Kube Forensics - Addon Manager - Upgrade Manager - Minion Manager - Governor