ironcore-dev/metal-operator

Support Ad-hoc Operations for Server Management

Closed this issue · 8 comments

Summary:
We need to extend our declarative model with resources like Server and ServerClaim to support ad-hoc operations such as reboot or power cycle of a server. This enhancement aims to address how we can incorporate imperative operations within the declarative Kubernetes API model.

Background:
Currently, our API allows managing bare metal servers using Server and ServerClaim resources. However, we face challenges in supporting ad-hoc operations, which are inherently imperative and contradict the declarative nature of the Kubernetes API model.

Proposed Solutions:

  1. Annotations-based Approach:

    • Allow ad-hoc operations by adding annotations to the Server resource.
    • Example: metal.ironcore.dev/operations: PowerCycle.
    • The reconciler will check for the presence of this annotation and perform the corresponding operation, such as power cycling the server.

    Pros:

    • Simple to implement and integrate with existing CRD-based models.
    • Minimal changes required to the existing API structure.

    Cons:

    • Annotations are not a first-class citizen for defining operations and might lead to less discoverable and manageable API.
    • Handling multiple concurrent operations or complex workflows might become cumbersome.
  2. Aggregated API Server with SubResources:

    • Transition from a CRD-based model to an aggregated API server.
    • Define custom subresources like PowerCycle and Reboot for the Server resource.
    • Example: POST /apis/metal.ironcore.dev/v1/namespaces/{namespace}/servers/{name}/powercycle.

    Pros:

    • Provides a more RESTful and discoverable way to define and manage imperative operations.
    • Subresources can encapsulate complex logic and workflows better.

    Cons:

    • Requires significant refactoring to migrate to an aggregated API server model.
    • Increased complexity in terms of deployment and maintenance.

Request for Comments:
We seek feedback on the following points:

  • Which approach (Annotations or Aggregated API Server) is more suitable for our use case?
  • Any potential challenges or alternatives that we should consider.
  • Best practices for implementing imperative operations in a declarative system.

Next Steps:
Based on the feedback, we will:

  • Finalize the approach for implementing ad-hoc operations.
  • Create a detailed implementation plan.
  • Assign tasks and start the development process.

Additional Context:

  • Link to relevant documentation or previous discussions.
  • Examples of similar implementations in other projects (if any).

Please provide your feedback and suggestions to help us move forward with this enhancement.

Just as an information, metal3's annotation design allows multiple controllers to set the power state independently: https://book.metal3.io/bmo/reboot_annotation#phased-reboot

I would vote for option 1

Then I would suggest, that we proceed with option one.

do we have a list of operations we want to support?

At least for reboot and power cycle there seems to some overlap with #76, which would be a declarative solution. I expect the implementation not arriving soon. In the meanwhile going with an annotation is fine, because it's not part of the API contract and can be removed with ease.

@afritzler did I understand ad-hoc correctly that those operations should be executed early in the reconcile loop, no matter the server state and spec, or do we want to restrict them in some way?

The question is: do we want to PowerCycle if the Server is in PowerState == Off? I would suggest that we only do a reboot if the Server is On. If it is powered off, it is a no-op.

I would hope that the servers bmc would then just ignore it in this case. If not, maybe it was something the enduser wanted to do?

For me the more important question is:
when is the moment we execute those operations. If someone sets operation=PowerCycle, should the operation be executed immediately, ignoring anything else that is currently happening in the reconcile loop.