Maintenance Controller

A Kubernetes controller to manage node maintenance.

Motivation
Concept
Installation
Configuration
- Check Plugins
- Notification Plugins
- Trigger Plugins
Example configuration for flatcar update agents

Motivation

Sometimes the nodes of a Kubernetes cluster need to be put into maintenance. There exist several reason, like updating the node's operation system or updating the kubelet daemon. Putting a node into maintenance requires to cordon and drain the node. Stateful applications might have special constraints regarding their termination, which cannot be handled easily using Kubernetes "PreStopHooks" (e.g. High Availability scenarios). In enterprise contexts additional processes might influence, when a node maintenance is allowed to occur.

The maintenance controller supports enforcing maintenance processes, automating maintenance approvals and customization of termination logic. It is build with flexibility in mind and should be adaptable to different environments and requirements. This property is achieved with an extensible plugin systems

Concept

Kubernetes nodes are modelled as finite state machines and can be in one of three states.

Operational
Maintenance Required
In Maintenance

A node's current state is saved within a configurable node label. Nodes transition to the state if a chain of configurable "check plugins" decides that the node's state should move on. Such plugin chains can be configured for each state individually via maintenance profiles. Cluster administrators can assign a maintenance profile to a node using a label. Before the transition is finished a chain of "trigger plugins" can be invoked, which can perform any action related to termination or startup logic. While a node is in a certain state a chain of "notifications plugins" informs the cluster users and adminstrators regulary about the node being in that state. Multiple plugins exist. It is possible to check labels, to alter labels, to be notified via Slack, ...

Installation

Execute make deploy IMG=sapcc/maintenance-controller.

Configuration

There is a global configuration, which defines some general options, plugin instances and maintenance profiles. The global configuration should be named ./config/maintenance.yaml and should be placed relative to the controllers working directory preferably via a Kubernetes secret or a config map. The basic structure looks like this:

intervals:
  # defines after which duration a node should be checked again
  requeue: 200ms
  # defines after which duration a reminder notification should be send
  notify: 500ms
# plugin instances are the combination of a plugin and its configuration
instances:
  # the are no notification plugins configured here, but their configuration works the same way as for check and trigger plugins
  notify: null
  # check plugin instances
  check:
  # the list enttries define the chosen plugin type
  - hasLabel:
      # name of the instance, which is used in the plugin chain configurations
      # do not use spaces or other special characters, besides the underscore, which is allowed
      name: transition
      # the configuration for the plugin. That block depends obviously on the plugin type
      config:
        key: transition
        value: "true"
  # trigger plugin instances
  trigger:
  - alterLabel:
      name: alter
      config:
        key: alter
        value: "true"
        remove: false
profiles:
  # define a maintenance profile called someprofile
  someprofile:
    # define the plugin chains for the operational state
    operational:
      # the exit condition for the operational state refers to the "transition" plugin instance defined in the instances section
      check: transition
      # the notification instances to invoke while in the operational state
      notify: somenotificationplugin
      # the trigger instances which are invoked when leaving the operational state
      trigger: alter
    # define the plugin chains for the maintenance-required state
    maintenance-required:
      # define chains as shown with the operational state
      check: null
      notify: null
      trigger: null
    # define plugin chains for the in-maintenance state
    in-maintenance:
      # check chains support boolean operations which evaluate multiple instances
      check: transition && !(a || b)
      # multiple notification instances can be used also
      notify: g && h
      # multiple trigger instances can be used also
      trigger: t && u

Chains be undefined or empty. Trigger and Notification chains are configured by specifing the desired instance names sperated by &&, e.g. prefix-operational-trigger=alter && othertriggerplugin Check chains be build using boolean expression, e.g. transition && !(a || b) To attach a maintenance profile to a node the label cloud.sap/maintenance-profile=NAME has to be assigned the desired profile name. If that label is not present on a node the controller will use the default profile, which does nothing at all. The dafult profile can be reconfigured if it is defined within the config file. The controllers state is tracked with the cloud.sap/maintenance-state label and the cloud.sap/maintenance-data annotation.

Check Plugins

hasAnnotation: Checks if a node has an annotation with the given key. Optionally asserts the annotation value.

config:
  key: the annotation key, required
  value: the expect annotation value, if empty only the key is checked, optional

hasLabel Checks if a node has a label with the given key. Optionally asserts the labels value.

config:
  key: the annotation key, required
  value: the expect annotation value, if empty only the key is checked, optional

maxMaintenance: Checks that less than the specified amount of nodes are in the in-maintenance state. Due to optimistic concurrency control of the API-Server this check might return the wrong result if more than one node is reconciled at any given time.

config:
  max: the limit of nodes that are in-maintenance

timeWindow: Checks if the current systemtime is within the specified weekly UTC time window.

config:
  start: the timewindows start time in "hh:mm" format, required
  end: the timewindows end time in "hh:mm" format, required
  weekdays: weekdays when the time window is valid as array e.g. [monday, tuesday, wednesday, thursday, friday, saturday, sunday], required

wait: Checks if a certain duration has passed since the last state transition

config:
  duration: a duration according to the rules of golangs time.ParseDuration(), required

Notification Plugins

mail: Sends an e-mail

config:
  auth: boolean value, which defines if the plugin should use plain auth or no auth at all, required
  address: address of the smtp server with port, required
  from: e-mail address of the sender, required
  identity: the identity used for authentification against the smtp server, optional
  subject: the subject of the mail
  message: the content of the mail, this supports golang templating e.g. {{ .State }} to get the current state as string or {{ .Node }} to access the node object, required
  password: the password used for authentification against the smtp server, optional
  to: array of recipients, required
  user: the user used for authentification against the smtp server, optional

slack: Sends a slack message

config:
  hook: an incoming slack webhook, required
  channel: the channel which the message should be send to, required
  message: the content of the slack message, this supports golang templating e.g. {{ .State }} to get the current state as string or {{ .Node }} to access the node object, required

Trigger Plugins

alterAnnotation: Adds, changes or removes an annotation

config:
  key: the annotations key, required
  value: the value to set, optional
  remove: boolean value, if true the annotation is removed, if false the annotation is added or changed, optional

alterLabel: Adds, changes or removes a label

config:
  key: the labels key, required
  value: the value to set, optional
  remove: boolean value, if true the label is removed, if false the label is added or changed, optional

Example configuration for flatcar update agents

intervals:
    requeue: 60s
    notify: 5h
instances:
    notify:
    - slack:
        name: approval_required
        config:
          hook: Your hook
          channel: Your channel
          message: |
            The node {{ .Node.Name }} requires maintenace. Manual approval is required.
            Approve to drain and reboot this node by running:
            `kubectl annotate node {{ .Node.Name }} cloud.sap/maintenance-approved=true`
    - slack:
        name: maintenance_started
        config:
          hook: Your hook
          channel: Your channel
          message: |
            Maintenace for node {{ .Node.Name }} has started.
    check:
    - hasAnnotation:
        name: reboot_needed
        config:
            key: flatcar-linux-update.v1.flatcar-linux.net/reboot-needed
            value: "true"
    - hasAnnotation:
        name: check_approval
        config:
            key: cloud.sap/maintenance-approved
            value: "true"
    trigger:
    - alterAnnotation:
        name: reboot-ok
        config:
            key: flatcar-linux-update.v1.flatcar-linux.net/reboot-ok
            value: "true"
    - alterAnnotation:
        name: remove_approval
        config:
            key: cloud.sap/maintenance-approved
            remove: true
    - alterAnnotation:
        name: remove_reboot_ok
        config:
            key: flatcar-linux-update.v1.flatcar-linux.net/reboot-ok
            remove: true
profiles:
  flatcar:
    operational:
      check: reboot_needed
    maintenance-required:
      check: check_approval
      notify: approval_required
      trigger: remove_approval && reboot-ok
    in-maintenance:
      check: "!reboot_needed"
      notify: maintenance_started
      trigger: remove_reboot_ok

mifrandir/maintenance-controller