intel/intel-technology-enabling-for-openshift

Avoiding node rebooting for machine configurations

hershpa opened this issue · 5 comments

Summary:

Node rebooting presents several challenges. Certain machine configurations require reboot of node(s) on a OpenShift cluster. Typically, machine configuration (MachineConfig) updates or changes on a OpenShift cluster are facilitated by the Machine Config Operator (MCO). From a cluster administrator or end user perspective, reboots may not be preferred in a production environment for a variety of reasons.

Process:

A reboot of a node typically involves cordoning the node (prevents the scheduler from placing new pods onto that node). Then, the node is drained which means all running pods are removed from the node. When possible, the scheduler will attempt to reschedule pod(s) that are evicted on node A to another node B. This scenario can prove challenging. During reboot, the node state will be not ready and then the node will become ready if the reboot succeeds gracefully. Finally, the node is uncordoned, (marked as schedulable) meaning new pods can be scheduled on the node. If multiple nodes are targeted by a specific MachineConfig, typically the nodes are rebooted sequentially.

Examples:

  • Since the default firmware directory /lib/firmware is read-only on OCP cluster nodes, MachineConfig is used to set an alternative firmware path via firmware_class.path=/var/lib/firmware so that out-of-tree (OOT) firmware can be loaded on a RHCOS node. Kernel Module Management (KMM) Operator copies the firmware from the driver container to the alternative firmware path after the driver container is deployed. This approach is used to load OOT GPU firmware and provision the Intel GPU card on OpenShift.

  • Similarly, for QAT, the kernel parameter intel_iommu is turned on via MCO. All MCO operations trigger a one time reboot per node to reach the desired configuration.

Goal:

When possible, the goal is to perform the configuration operations at runtime to avoid disruption to the cluster and workloads.

Possible Solutions to Certain Scenarios:

In certain scenarios, it may be possible to facilitate a configuration change at runtime.

  • For the alternative firmware path, it may be possible to have KMM configure the lookup path at runtime before loading any module.
    The lookup path is configured on the node with the following command: echo /var/lib/firmware > /sys/module/firmware_class/parameters/path
    For more details on firmware search paths, review details here.

  • Another option is to deploy a privileged DaemonSet that configures the lookup path at runtime and then sleeps forever.
    Note if the node is rebooted, this lookup path has to be configured again. With the above 2 options, the lookup path should always be configured prior to load of any module. This should be facilitated by design.

  • Here is a successful example of facilitating a node configuration change at runtime: KMM 1.1 facilitates removal of an in-tree module prior to loading the OOT module at runtime.

Hi @qbarrand and @ybettan, we would love to have your perspective and insight. Especially on the idea of KMM configuring the alternative firmware lookup path on the fly. Thanks in advance.

The current idea for KMM 2.0 is to run only one DaemonSet; on each node, one pod would download module images, extract them and load kmods. This should facilitate other operations, such as unloading in-tree modules or specifying dependencies. We could also make that pod configure the search path by writing the lookup path to /sys/module/firmware_class/parameters/path as soon as it starts.
@yevgeny-shnaidman WDYT?

This will require mounting the /sys host FS into the daemonset with RW permissions. Currently we get it for free by using the "privileged" SCC, and it is mounted RO

Thanks for the input @qbarrand and @yevgeny-shnaidman. Would mounting /sys host FS be a viable option?

PR in KMM upstream to set alternative FW path on the fly: kubernetes-sigs/kernel-module-management#586. Targeted for KMM 2.0