A Kubernetes controller to manage node maintenance.
- Motivation
- Concept
- Installation
- Configuration
- Check Plugins
- Notification Plugins
- Trigger Plugins
- Example configuration for flatcar update agents
Sometimes the nodes of a Kubernetes cluster need to be put into maintenance. There exist several reason, like updating the node's operation system or updating the kubelet daemon. Putting a node into maintenance requires to cordon and drain the node. Stateful applications might have special constraints regarding their termination, which cannot be handled easily using Kubernetes "PreStopHooks" (e.g. High Availability scenarios). In enterprise contexts additional processes might influence, when a node maintenance is allowed to occur.
The maintenance controller supports enforcing maintenance processes, automating maintenance approvals and customization of termination logic. It is build with flexibility in mind and should be adaptable to different environments and requirements. This property is achieved with an extensible plugin systems
Kubernetes nodes are modelled as finite state machines and can be in one of three states.
- Operational
- Maintenance Required
- In Maintenance
A node's current state is saved within a configurable node label. Nodes transition to the state if a chain of configurable "check plugins" decides that the node's state should move on. Such plugin chains can be configured for each state individually via maintenance profiles. Cluster administrators can assign a maintenance profile to a node using a label. Before the transition is finished a chain of "trigger plugins" can be invoked, which can perform any action related to termination or startup logic. While a node is in a certain state a chain of "notifications plugins" informs the cluster users and adminstrators regulary about the node being in that state. Multiple plugins exist. It is possible to check labels, to alter labels, to be notified via Slack, ...
Execute make deploy IMG=sapcc/maintenance-controller
.
There is a global configuration, which defines some general options, plugin instances and maintenance profiles.
The global configuration should be named ./config/maintenance.yaml
and should be placed relative to the controllers working directory preferably via a Kubernetes secret or a config map.
The basic structure looks like this:
intervals:
# defines after which duration a node should be checked again
requeue: 200ms
# defines after which duration a reminder notification should be send
notify: 500ms
# plugin instances are the combination of a plugin and its configuration
instances:
# the are no notification plugins configured here, but their configuration works the same way as for check and trigger plugins
notify: null
# check plugin instances
check:
# the list enttries define the chosen plugin type
- hasLabel:
# name of the instance, which is used in the plugin chain configurations
# do not use spaces or other special characters, besides the underscore, which is allowed
name: transition
# the configuration for the plugin. That block depends obviously on the plugin type
config:
key: transition
value: "true"
# trigger plugin instances
trigger:
- alterLabel:
name: alter
config:
key: alter
value: "true"
remove: false
profiles:
# define a maintenance profile called someprofile
someprofile:
# define the plugin chains for the operational state
operational:
# the exit condition for the operational state refers to the "transition" plugin instance defined in the instances section
check: transition
# the notification instances to invoke while in the operational state
notify: somenotificationplugin
# the trigger instances which are invoked when leaving the operational state
trigger: alter
# define the plugin chains for the maintenance-required state
maintenance-required:
# define chains as shown with the operational state
check: null
notify: null
trigger: null
# define plugin chains for the in-maintenance state
in-maintenance:
# check chains support boolean operations which evaluate multiple instances
check: transition && !(a || b)
# multiple notification instances can be used also
notify: g && h
# multiple trigger instances can be used also
trigger: t && u
Chains be undefined or empty.
Trigger and Notification chains are configured by specifing the desired instance names sperated by &&
, e.g. prefix-operational-trigger=alter && othertriggerplugin
Check chains be build using boolean expression, e.g. transition && !(a || b)
To attach a maintenance profile to a node the label cloud.sap/maintenance-profile=NAME
has to be assigned the desired profile name.
If that label is not present on a node the controller will use the default
profile, which does nothing at all.
The dafult profile can be reconfigured if it is defined within the config file.
The controllers state is tracked with the cloud.sap/maintenance-state
label and the cloud.sap/maintenance-data
annotation.
hasAnnotation: Checks if a node has an annotation with the given key. Optionally asserts the annotation value.
config:
key: the annotation key, required
value: the expect annotation value, if empty only the key is checked, optional
hasLabel Checks if a node has a label with the given key. Optionally asserts the labels value.
config:
key: the annotation key, required
value: the expect annotation value, if empty only the key is checked, optional
maxMaintenance: Checks that less than the specified amount of nodes are in the in-maintenance state. Due to optimistic concurrency control of the API-Server this check might return the wrong result if more than one node is reconciled at any given time.
config:
max: the limit of nodes that are in-maintenance
timeWindow: Checks if the current systemtime is within the specified weekly UTC time window.
config:
start: the timewindows start time in "hh:mm" format, required
end: the timewindows end time in "hh:mm" format, required
weekdays: weekdays when the time window is valid as array e.g. [monday, tuesday, wednesday, thursday, friday, saturday, sunday], required
wait: Checks if a certain duration has passed since the last state transition
config:
duration: a duration according to the rules of golangs time.ParseDuration(), required
mail: Sends an e-mail
config:
auth: boolean value, which defines if the plugin should use plain auth or no auth at all, required
address: address of the smtp server with port, required
from: e-mail address of the sender, required
identity: the identity used for authentification against the smtp server, optional
subject: the subject of the mail
message: the content of the mail, this supports golang templating e.g. {{ .State }} to get the current state as string or {{ .Node }} to access the node object, required
password: the password used for authentification against the smtp server, optional
to: array of recipients, required
user: the user used for authentification against the smtp server, optional
slack: Sends a slack message
config:
hook: an incoming slack webhook, required
channel: the channel which the message should be send to, required
message: the content of the slack message, this supports golang templating e.g. {{ .State }} to get the current state as string or {{ .Node }} to access the node object, required
alterAnnotation: Adds, changes or removes an annotation
config:
key: the annotations key, required
value: the value to set, optional
remove: boolean value, if true the annotation is removed, if false the annotation is added or changed, optional
alterLabel: Adds, changes or removes a label
config:
key: the labels key, required
value: the value to set, optional
remove: boolean value, if true the label is removed, if false the label is added or changed, optional
intervals:
requeue: 60s
notify: 5h
instances:
notify:
- slack:
name: approval_required
config:
hook: Your hook
channel: Your channel
message: |
The node {{ .Node.Name }} requires maintenace. Manual approval is required.
Approve to drain and reboot this node by running:
`kubectl annotate node {{ .Node.Name }} cloud.sap/maintenance-approved=true`
- slack:
name: maintenance_started
config:
hook: Your hook
channel: Your channel
message: |
Maintenace for node {{ .Node.Name }} has started.
check:
- hasAnnotation:
name: reboot_needed
config:
key: flatcar-linux-update.v1.flatcar-linux.net/reboot-needed
value: "true"
- hasAnnotation:
name: check_approval
config:
key: cloud.sap/maintenance-approved
value: "true"
trigger:
- alterAnnotation:
name: reboot-ok
config:
key: flatcar-linux-update.v1.flatcar-linux.net/reboot-ok
value: "true"
- alterAnnotation:
name: remove_approval
config:
key: cloud.sap/maintenance-approved
remove: true
- alterAnnotation:
name: remove_reboot_ok
config:
key: flatcar-linux-update.v1.flatcar-linux.net/reboot-ok
remove: true
profiles:
flatcar:
operational:
check: reboot_needed
maintenance-required:
check: check_approval
notify: approval_required
trigger: remove_approval && reboot-ok
in-maintenance:
check: "!reboot_needed"
notify: maintenance_started
trigger: remove_reboot_ok