Halkyon operator framework

This project focuses on extracting code that, while somewhat targeted at the Halkyon operator, could be used in other operator projects generated by the operator-sdk project. More specifically, parts of the code is specific to Halkyon because the priority hasn’t been on making it fully independent, but most of the code is generic, and it shouldn’t be too complex to extract the Halkyon-specific parts out completely.

Architecture

We designed this framework to ease the development of operators. The base premise is to abstract away the low-level concerns of how to reconcile the state of custom resources. Users can thus focus on the actual logic of what the operator is trying to achieve.

Since we developed this framework while working on Halkyon, we will use Halkyon to illustrate the concepts.

Resource and DependentResource

This framework introduces two core concepts, represented by the Resource and DependentResource interfaces that operator creators must implement to use this framework/

A Resource (also called "primary" resource) represents a custom resource that the operator manages. This is one specificity of this framework: it assumes that you deal with one or several custom resources.

Resource interface
// Resource is the core interface allowing users to define the behavior of primary resources. A Resource is primarily
// responsible for managing the set of its associated DependentResources and taking the appropriate actions based on their status
type Resource interface {
	v1.Object
	runtime.Object
	v1beta1.StatusAware
	// NeedsRequeue determines whether this Resource needs to be requeued in the reconcile loop
	NeedsRequeue() bool
	// ComputeStatus computes the status of this Resource based on the cluster state. Default implementation uses the
	// aggregated status of this Resource's dependents' condition. Return value indicates whether the status of the Resource has
	// changed as the result of the computation and therefore the needs to be updated on the cluster.
	ComputeStatus() (needsUpdate bool)
	// CheckValidity checks whether this Resource is valid according to its semantics. Note that some/all of this functionality
	// might be implemented as a validation webhook instead.
	CheckValidity() error
	// ProvideDefaultValues initializes any potentially missing optional values to appropriate defaults
	ProvideDefaultValues() bool
	// GetUnderlyingAPIResource returns the object implementing the custom resource this Resource represents as a
	// SerializableResource
	GetUnderlyingAPIResource() SerializableResource
	// Delete performs any operation that might be needed when a reconcile request occurs for a Resource that does not exist on
	// the cluster anymore
	Delete() error
	// CreateOrUpdate creates or updates all dependent resources associated with this Resource depending on the state of the
	//cluster
	CreateOrUpdate() error
	// NewEmpty returns a new empty instance of this Resource so that it can be populated during the reconcile loop. Note that
	// NewEmpty must return a Resource with an initialized GroupVersionKind so that calls to the GroupVersionKind method is
	// guaranteed to return a non-empty GroupVersionKind
	NewEmpty() Resource
	// InitDependentResources returns the array of DependentResources that are associated with this Resource.
	InitDependentResources() ([]DependentResource, error)
}

A DependentResource, on the other hand, represents any resource that is needed to realize the desired state as described by a Resource under the operator’s control. A Resource is therefore associated to a set of DependentResource which it needs to be realized on the cluster.

DependentResource interface
// DependentResource represents any resource a Resource requires to be realized on the cluster.
type DependentResource interface {
	// Name returns the name used to identify this DependentResource on the cluster, given the parent Resource's namespace
	Name() string
	// Owner returns the SerializableResource owning this DependentResource. For all intent and purposes, this owner is a
	// Resource, reduced to its strictly needed information so that it can be serialized and sent over the network to plugins.
	Owner() SerializableResource
	// Fetch retrieves the object associated with this DependentResource from the cluster
	Fetch() (runtime.Object, error)
	// Build generates the runtime.Object needed to store the representation of this DependentResource on the cluster. For
	// example, a DependentResource representing a Kubernetes secret would return a Secret object as defined by the Kubernetes
	// API.
	Build(empty bool) (runtime.Object, error)
	// Update applies any needed changes to the specified runtime.Object and returns an updated version which calling code needs
	// to use since the return object might be different from the input one. The first return value is a bool indicating whether
	// or not the input object was changed in the process so that the framework can know whether to store the updated value.
	Update(toUpdate runtime.Object) (bool, runtime.Object, error)
	// GetCondition returns a DependentCondition object describing the condition of this DependentResource based either on the
	// state of the specified underlying runtime.Object (i.e. the Kubernetes resource) associated with this DependentResource or
	// the given error which might have occurred while processing this DependentResource.
	GetCondition(underlying runtime.Object, err error) *v1beta1.DependentCondition
	// GetConfig retrieves the configuration associated with this DependentResource, configuration describing how the framework
	// needs to handle this DependentResource when it comes to watching it for changes, updating it, etc.
	GetConfig() DependentResourceConfig
}

For example, as Halkyon defines two custom resources Component and Capability, each of these are implemented as a Resource. While the Component resource is rather simple, it is actually realized by the combination of several Kubernetes resources on the cluster: ServiceAccount, Deployment, Service, PVC, Ingress, etc… Each of these resources is defined as "dependent" (or "secondary") resources to the Component Resource. Note also that a dependent resource can also be another custom resource: this is actually the case for the Component resource which can have Capability dependents.

An important part of what an operator does is computing the status of a given resource and decide what, if anything, needs to be done to reconcile the cluster state with the state desired by the user, as expressed by the custom resources it handles. This framework relies on the concept of DependentCondition which allows each dependent resource to report its status, which in turn allows the associated primary resource to compute an aggregated status. While the conditions rely on basic status such as Ready, Failed or Pending, it is possible for a dependent resource to define a more specific status.

DependentCondition struct
// DependentCondition contains details for the current condition of the associated DependentResource.
type DependentCondition struct {
	// Type of the condition.
	Type DependentConditionType `json:"type"`
	// Type of the dependent associated with the condition.
	DependentType schema.GroupVersionKind `json:"dependentType"`
	// Name of the dependent associated with the condition.
	DependentName string `json:"dependentName"`
	// Records the last time the condition transitioned from one status to another.
	// +optional
	LastTransitionTime v1.Time `json:"lastTransitionTime,omitempty"`
	// Unique, one-word, CamelCase reason for the condition's last transition.
	// +optional
	Reason string `json:"reason,omitempty"`
	// Human-readable message indicating details about last transition.
	// +optional
	Message string `json:"message,omitempty"`
	// Additional information that the condition wishes to convey/record as name-value pairs.
	// +optional
	Attributes []NameValuePair `json:"attributes,omitempty"`
}

Some of these dependent resources have a lifetime that is tied to their associated primary resource’s while others don’t. Similarly, the operator might need to be informed of changes to the dependent resources to update the state / status of the primary resource. Dependent resources might need to be created or updated during the life time of the primary resource. The framework will take the appropriate action based on the DependentResourceConfig associated with a DependentResource.

DependentResourceConfig struct
// DependentResourceConfig represents the configuration associated with a DependentResource. The framework takes action based on
// this configuration, for example, on whether the associated DependentResource is checked for readiness when assessing the
// status of its associated Resource or whether it needs to be watched, created or updated… The defaultConfig var records the
// default values for these who might be omitted.
type DependentResourceConfig struct {
	// Watched determines whether the operator should be notified when the associated DependentResource's state changes.
	// Defaults to true.
	Watched bool
	// Owned determines whether the Resource associated with the associated DependentResource owns this DependentResource,
	// meaning that the lifecycle of the DependentResource is tied to that of its Resource (e.g. the DependentResource is
	// deleted when the parent Resource is deleted). Defaults to true.
	Owned bool
	// Created determines whether the associated DependentResource should be created if it doesn't already exist. Generally,
	// this should be true, however, in some cases such as when a DependentResource is actually another Resource, i.e.
	// something that can (and maybe needs to) be created by a user, this should be set to false indicating that the operator
	// should wait for the associated DependentResource to be created, independently. Defaults to true.
	Created bool
	// Updated determines whether the associated DependentResource defines custom behavior to be applied when the resource
	// already exists on the cluster. Defaults to false.
	Updated bool
	// CheckedForReadiness determines whether the associated DependentResource should participate in the overall status of the
	// parent Resource, in particular when it comes to checking whether the Resource is considered ready to be used. Defaults
	// to false.
	CheckedForReadiness bool
	// GroupVersionKind records the GroupVersionKind of the associated DependentResource so that it can be used with
	// Unstructured for example.
	GroupVersionKind schema.GroupVersionKind
	// TypeName records the DependentResource's type to be displayed in messages / logs, this defaults to its associated Kind
	// but, in some instances, e.g. for Capabilities part of Component's contract, it might be needed to be overridden to be
	// more precise / specific.
	TypeName string
}

Rooted in these concepts, the framework provides default, generic behaviors enabling users to quickly get running while still providing customization point so that some parts of the behavior can be adapted as needed.

Base implementations

Recognizing that there are lots of commonality in how the core interfaces might be implemented, the framework also offers base implementations that can be embedded in your own to make it even easier to provide support for a primary resource and its dependents.

BaseResource can be used as a starting point for a Resource interface implementation.

Here’s how Halkyon uses BaseResource to bootstrap the implementation of the Resource interface for the code that is responsible for handling Halkyon Component (defined by the Halkyon API):

Component reuse of BaseResource
package component

import (
	halkyon "halkyon.io/api/component/v1beta1"
	"halkyon.io/operator-framework"
)

// blank assignment to check that Component implements Resource
var _ framework.Resource = &Component{}

// Component implements the Resource interface to handle behavior tied to the state of Halkyon's Component CR.
type Component struct {
	*halkyon.Component
	*framework.BaseResource
}

// NewComponent creates a new Component instance, reusing BaseResource as the foundation for its behavior
func NewComponent() *Component {
	c := &Component{Component: &halkyon.Component{}}
	// initialize the BaseResource, delegating its status handling to our newly created instance as StatusAware instance
	c.BaseResource = framework.NewBaseResource(c)
	c.Component.SetGroupVersionKind(c.Component.GetGroupVersionKind()) // make sure that GVK is set on the runtime object
	return c
}

Once this is set up, Component can reuse behavior from BaseResource. For example, Component’s implementation of `Resource’s `CreateOrUpdate method, first calls BaseResource’s `CreateOrUpdateDependents and then adds further logic.

Similarly, we provide a BaseDependentResource implementation which provides some default behavior to serve as the basis for DependentResource implementations.

Here is how BaseDependentResource can be used:

Using BaseDependentResource
package foo

import (
	framework "halkyon.io/operator-framework"
	v1 "k8s.io/api/core/v1"
)

// Records the GVK for the underlying type we're interested in working with (here, a Pod)
var podGVK = v1.SchemeGroupVersion.WithKind("Pod")

// example is a simple, example implementation of DependentResource
type example struct {
	*framework.BaseDependentResource
}

// blank assignment to make sure that our struct properly implements the DependentResource interface
var _ framework.DependentResource = &example{}

// NewOwnerResource creates a new example instance given the specified owner Resource as a SerializableResource
func NewOwnerResource(owner framework.SerializableResource) *example {
	// Create a new, default config with the specified GVK
	config := framework.NewConfig(podGVK)
	// Override some of the default configuration if needed, here we want to check this dependent for its
	// readiness when computing the owner's status
	config.CheckedForReadiness = true
	// Create an instance of the struct, properly initializing the embedded BaseDependentResource
	p := &example{framework.NewConfiguredBaseDependentResource(owner, config)}
	return p
}

We can then implement the missing DependentResource methods, using the default implementations provided by the framework.

Here is how this example DependentResource could implement the GetCondition method using the default implementation to set things up before checking if the underlying Pod is ready:

GetCondition implementation using default implementation
func (res example) GetCondition(underlying runtime.Object, err error) *v1beta1.DependentCondition {
	return framework.DefaultCustomizedGetConditionFor(res, err, underlying, func(underlying runtime.Object, cond *v1beta1.DependentCondition) {
		pod := underlying.(*v1.Pod)
		for _, c := range pod.Status.Conditions {
			if c.Type == v1.PodReady {
				cond.Type = v1beta1.DependentReady
				if c.Status != v1.ConditionTrue {
					cond.Type = v1beta1.DependentPending
				}
				cond.Message = c.Message
				cond.Reason = c.Reason
			}
		}
		return
	})
}

CreateOrUpdate(r DependentResource) error function provides the generic steps needed to create or update (shocking, right?) DependentResource instances and show how to use many of the DependentResource methods.

Generic reconciler

Once we have defined what a Resource should do and how they are composed of DependentResources, we need to tell our operator how to reconcile instances. To this end, we provide a generic reconciler implementation that knows how to reconcile Resource instances base on the implementation of the interface that they provide.

Here’s the GenericReconciler and how to instantiate a new instance provided a given Resource implementation.

GenericReconciler
// GenericReconciler implements Reconciler in a generic way as it pertains to reconciling a Resource
type GenericReconciler struct {
	resource Resource
}

// blank assignment to make sure we implement Reconciler
var _ reconcile.Reconciler = &GenericReconciler{}

// NewGenericReconciler creates a new GenericReconciler that can handle resources represented by the specified Resource, which
// acts as a prototype standing in for instances that will be reconciled.
func NewGenericReconciler(resource Resource) *GenericReconciler {
	return &GenericReconciler{resource: resource}
}

We pass an instance of Resource that acts as a prototype for instances that will be reconciled. If you look at the Reconcile method, you’ll see how Resource methods are used. In particular, we begin by "instantiating" a new, empty instance from the prototype, which we then initialize by fetching the associated state on the cluster:

Resource initialization
// Get a new empty instance from the prototype
resource := b.resource.NewEmpty()
// Initialize it from the cluster state, using the name / namespace from the reconcile request
resource.SetName(request.Name)
resource.SetNamespace(request.Namespace)
_, err := Helper.Fetch(request.Name, request.Namespace, resource.GetUnderlyingAPIResource())

Helper

You might have noticed above that we delegated the fetching par to something called Helper. This is a K8SHelper instance set up by the operator when it initializes.

K8SHelper
// K8SHelper provides access to, and ways to interact with, the Kubernetes environment we're running on
type K8SHelper struct {
	Client client.Client
	Config *rest.Config
	Scheme *runtime.Scheme
}

// Helper provides easy access to the K8SHelper that has been set up when the operator called InitHelper
var Helper K8SHelper

Using the framework to implement a new operator

Once you’ve generated your operator’s skeleton using operator-sdk and created your Resource and DependentResource implementations, you can replace some of the skeleton code by calls to this framework. The basic steps are as follows:

Bootstraping the framework in your operator’s main function
// Retrieve the configuration and create a new Manager
config := config.GetConfigOrDie()
mgr, err := manager.New(config, options)
if err != nil {
    log.Error(err, "")
}

// Initialize the helper as soon as the manager is created
framework.InitHelper(mgr)

// Add our CRs to the manager's scheme
log.Info("Registering Halkyon resources")
if err := halkyon.AddToScheme(mgr.GetScheme()); err != nil {
    log.Error(err, "")
}

// Register 3rd party resources we might need (note that plugins can dynamically register resources on demand)
log.Info("Registering 3rd party resources")
registerAdditionalResources(mgr)

...

// Create a new generic reconciler for our Resource implementation (here, Component)
if err := framework.RegisterNewReconciler(component.NewComponent(), mgr); err != nil {
    log.Error(err, "")
    os.Exit(1)
}

The two important steps which differ from the default generated code are:

  1. you need to call InitHelper as soon as the Manager instance is created

  2. you create your controller and register it differently without having to register watchers explicitly as this is all done by RegisterNewReconciler which takes the appropriate steps based on the behavior provided by your Resource implementation

Plugin architecture overview

Part of what makes Halkyon interesting is the capability system. While the capability concept is powerful, it only makes sense if capabilities can be added to Halkyon without requiring to modify its core. The goal of this plugin architecture is to make it as easy as possible to extend Halkyon by adding new capabilities as plugins. This has also the added advantage of being able to decouple the releases of the operator and that of its plugins, which can evolve separately (as long as API compatibility is maintained, of course).

The plugin architecture relies at its core on Hashicorp’s go-plugin. This, in turns, means that Halkyon plugins run as separate processes of the operator, relying on RPC communication with the core. A plugin, therefore, consists in two parts:

  • a client that runs in the operator process, controlling the lifecycle of and interacting with the second part of the plugin,

  • a server running in a separate process, implementing the plugin behavior.

However, from a user’s point of view, much, if not all, of that complexity is hidden. We also made a point of hiding that complexity for plugins implementors so that it is as easy as possible to create new plugins, without having to worry about the RPC infrastructure. Each plugin is compiled into a binary and needs to follow some conventions in order to be automatically discoverable and downloadable by the operator.

Note
While the use of RPC makes it technically possible to write plugins using different programming languages, we focused our efforts (and will only document) the use case of a Go-based plugin.

Client

The operator is only superficially aware of plugins: it loads them from a local plugins directory where each file is assumed to be a capability plugin which path is passed to the NewPlugin function. See Using plugins in Halkyon for more details.

This function sets the RPC plumbing, in particular, starts the plugin process, opens a client to it and registers the plugin so that the operator knows which capabilities it provides. All this is executed when the operator starts in its main function. From there, the operator is only aware of the plugin when it attempts to create a capability: based on the requested category and type combination, the operator will look for a plugin supporting such a pair to initialize the dependents of the capability object. If a plugin is found, the operator proceeds transparently interacting with the plugin via the capability object. If no plugin is found to support the category and type of the desired capability, the capability is set in error until a plugin can be provided (at this time, after an operator restart) to support it.

Here is the Plugin interface that the operator interacts with, though technically, it only ever calls GetTypes and ReadyFor directly:

// Plugin is the operator-facing interface that can be interacted with in Halkyon
type Plugin interface {
	// Name returns the name of this Plugin
	Name() string
	// GetCategory retrieves the CapabilityCategory supported by this Plugin
	GetCategory() halkyon.CapabilityCategory
	// GetTypes returns TypeInfo providing information about CapabilityTypes this Plugin supports
	GetTypes() []TypeInfo
	// ReadyFor initializes the DependentResources needed by the given Capability and readies the Plugin for requests by the host.
	// Note that the order in which the DependentResources are returned is significant and the operator will process them in the
	// specified order. This is needed because some capabilities might require some dependent resources to be present before
	// processing others.
	ReadyFor(owner *halkyon.Capability) []framework.DependentResource
	// Kill kills the RPC client and server associated with this Plugin when the host process terminates
	Kill()
}

The client takes care of marshalling requests to the plugin in the appropriate format and calls the associated server without the operator being none the wiser.

Note
Plugin implementors must not implement this interface directly. See Plugin implementation for more details.

Server

Here is the server interface:

type PluginServer interface {
	Build(req PluginRequest, res *BuildResponse) error
	GetCategory(req PluginRequest, res *halkyon.CapabilityCategory) error
	GetDependentResourceTypes(req PluginRequest, res *[]schema.GroupVersionKind) error
	GetTypes(req PluginRequest, res *[]TypeInfo) error
	IsReady(req PluginRequest, res *IsReadyResponse) error
	Name(req PluginRequest, res *string) error
	NameFrom(req PluginRequest, res *string) error
	Update(req PluginRequest, res *UpdateResponse) error
	GetConfig(req PluginRequest, res *framework.DependentResourceConfig) error
}

In typical RPC fashion, at least when it comes to Go’s implementation, the server exposes a set of functions which all follow the <function name>(<input parameter>, <pointer to a response holder>) error format, which is less than natural to interact with. This why we make sure that plugin implementors don’t need to deal with this and we only show this interface for reference purposes, rejoice! 😄

Plugin implementation

While the RPC part of the infrastructure is abstracted away but the Halkyon plugins architecture, plugin implementors still need to write some code in order to implement the capabilities they want to support. This behavior is encapsulated in one single interface:

// PluginResource gathers behavior that plugin implementors are expected to provide to the plugins architecture
type PluginResource interface {
	// GetSupportedCategory returns the CapabilityCategory that this plugin supports
	GetSupportedCategory() halkyon.CapabilityCategory
	// GetSupportedTypes returns the list of supported CapabilityTypes along with associated versions when they exist.
	// Note that, while a plugin can only support one CapabilityCategory (e.g. "database"), a plugin can provide support for
	// multiple CapabilityTypes (e.g. "postgresql", "mysql", etc.) within the confine of the specified category.
	GetSupportedTypes() []TypeInfo
	// GetDependentResourcesWith returns an ordered list of DependentResources initialized with the specified owner.
	// DependentResources represent secondary resources that the capability might need to work (e.g. Kubernetes Role or Secret)
	// along with the resource (if it exists) implementing the capability itself (e.g. KubeDB's Postgres).
	GetDependentResourcesWith(owner v1beta1.HalkyonResource) []framework.DependentResource
}

As you can see this closely mirrors the Plugin interface that the operator can interact with but is strictly focused on providing the required behavior with as simple an interface as possible.

In order to implement a plugin, you will need to create a go project importing this project and create a main function similar to the following one:

package main

import (
	plugins "halkyon.io/plugins/capability"
)

func main() {
	var p plugins.PluginResource = ... // create an instance of your PluginResource implementation
    plugins.StartPluginServerFor(p) // register your server and start it
}

You, of course, need to provide your own PluginResource implementation.

Example

A full-featured example can be seen at https://github.com/halkyonio/kubedb-capability

Using plugins in Halkyon

Halkyon will attempt to load every file it finds in its local plugins directory as a plugin. These files need to be binaries that can be run on the platform you’re running the operator on. As a convenience, it is possible to pass a comma-separated list of plugins to automatically download from github repositories to the operator. This is accomplished using the HALKYON_PLUGINS environment variable (which can, of course, be provided via a ConfigMap). Each plugin in the list is identified by a string following the <github org>/<repository name>@<release name>. When encountering such a plugin identifier, Halkyon will attempt to download a file found at: https://github.com/<github org>/releases/download/<repository name>/halkyon_plugin_<target OS>.tar.gz where <target OS> corresponds to the value reported by by the Go runtime under the runtime.GOOS value in the running operator. A good way to make sure that your plugin is downloadable by Halkyon is to use GoReleaser combined with GitHub actions. See https://github.com/halkyonio/kubedb-capability for more details.