canonical/training-operator

Update `training-operator` manifests

Closed this issue · 10 comments

Context

Each charm has a set of manifest files that have to be upgraded to their target version. The process of upgrading manifest files usually means going to the component’s upstream repository, comparing the charm’s manifest against the one in the repository and adding the missing bits in the charm’s manifest.

What needs to get done

https://docs.google.com/document/d/1a4obWw98U_Ndx-ZKRoojLf4Cym8tFb_2S7dq5dtRQqs/edit?pli=1#heading=h.jt5e3qx0jypg

Definition of Done

  1. Manifests are updated
  2. Upstream image is used

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5692.

This message was autogenerated

The training-operator recently introduced ValidatingWebhookConfigurations (kubeflow/training-operator#1993) for the various training resources this controller manages. Under the hood, the training-operator controller uses cert-controller for managing self-signed certificates that can be used in the webhook configuration.
The cert-controller component is used as part of the business logic of the training-operator, and in fact any potential error while managing the certificates, can cause the service to exit (see here).

It is important to note that the cert-controller will use a Secret as its source of truth for managing the certificates.

Based on the above information, the following were my approaches for adding support for these webhooks:

  1. Using the known cert.py script for generating self-signed certs and then using them for rendering webhooks is not an option because the actual workload controller expects the secret contents to be located in a specific volume in the training operator workload container; otherwise it will end up timing out waiting for the files. This line gets executed and the training operator service never starts correctly, more over, the webhook service never starts.
  2. The training-operator Deployment (upstream) explicitly mounts the Secret in a Volume mountPath, which is used by the cert-controller. Because of that, we can patch the training-operator StatefulSet and mount the secret in the expected volumeMount.
  3. Manually placing the contents of the Secret in /tmp/k8s-webhook-server/serving-certs is also not helping. At first glance it looks like while the files are there, the service does not seem to use them correctly.

Although I have tried all three different approaches, none of them seem to work correctly as the service (both the workload and the webhook) are not started correctly. I will have to do more testing on this and report back my findings.

Using pebble to push the Secret contents

I have tried this method, but the workload service does not seem to start correctly. I'm doing the following:

# charm.py

def _upload_secret(self):
    # --- Method 1, placing the secret contents in the directory
    from lightkube.resources.core_v1 import Secret
    webhook_secret = self.k8s_resource_handler.lightkube_client.get(Secret, name="training-operator-webhook-cert", namespace="kubeflow")
    self.container.push(path="/tmp/k8s-webhook-server/serving-certs/ca.crt", make_dirs=True, source=webhook_secret.data['ca.crt'], permissions=0o420)
    self.container.push(path="/tmp/k8s-webhook-server/serving-certs/ca.key", make_dirs=True, source=webhook_secret.data['ca.key'], permissions=0o420)
    self.container.push(path="/tmp/k8s-webhook-server/serving-certs/tls.crt", make_dirs=True, source=webhook_secret.data['tls.crt'], permissions=0o420)
    self.container.push(path="/tmp/k8s-webhook-server/serving-certs/tls.key", make_dirs=True, source=webhook_secret.data['tls.key'], permissions=0o420)

_upload_secret() is called during the initial hooks.

The problem I have found with this is that while the workload service (cert-rotation) is able to find the certificates in the right path, it times out looking for data in some sort of cache:

2024-05-29T13:02:59.670Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   certs are ready in /tmp/k8s-webhook-server/serving-certs
2024-05-29T13:02:59.769Z [training-operator] 2024-05-29T13:02:59Z       INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   no cert refresh needed
2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   Ensuring CA cert        {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   no cert refresh needed
2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   Ensuring CA cert        {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z       INFO    cert-rotation   CA certs are injected to webhooks
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z       INFO    setup   Certs ready
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z       INFO    setup   registering controllers...
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting Controller     {"controller": "tfjob-controller"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.builder      skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called       {"GVK": "kubeflow.org/v1, Kind=TFJob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "kubeflow.org/v1, Kind=TFJob", "path": "/validate-kubeflow-org-v1-tfjob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.webhook      Registering webhook     {"path": "/validate-kubeflow-org-v1-tfjob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.webhook      Starting webhook server
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Stopping and waiting for non leader election runnables
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    cert-rotation   stopping cert rotator controller
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    All workers finished    {"controller": "cert-rotator"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Stopping and waiting for leader election runnables
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       ERROR   controller-runtime.source.EventHandler  failed to get informer from cache       {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}

This could be because the secret data is only pushed to the workload container once in the charm's lifetime. To solve this, we could add a pebble notice to the workload container that checks the state of the uploaded data vs the actual data in the Secret, though I am not sure if this statement is entirely true. This has the following issues:

  1. The training-operator image is distroless - we would have to create a rock for it
  2. We'll be introducing a change that most likely we'll have to remove in the next version of the training-operator, as upstream is planning to use a different certificate generation method for the next release

Patching the StatefulSet

I have tried patching the StatefulSet from within the charm, and this worked. The major problem with this is that patching StatefulSets in juju is a dangerous activity that can have unintended consequences, plus the juju agent may overwrite everything on restart.

Deploying the training-operator as a Deployment

There is a third option for solving this, which is to deploy the training-operator as a Deployment, similar to what we do for the istio-gateway charm. This will allow us to mount the secret w/o interfering with juju operations at all.
The only problem with this is that, yet again, the upstream project will change the certificates creation approach for next release, meaning that we'll change this charm for one version and re-factor it again for the next one.

2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       ERROR   controller-runtime.source.EventHandler  failed to get informer from cache       {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}

This error line is saying that the Informer of the controller-runtime go package failed waiting to sync. From what I know, the informer is the go component that does "kubectl watch" to observe events for objects, put them in a cache of controller's K8s objects state, and then the reconciliation loop is triggered.

With a very diagonal look it seems weird what it means about *v1.Service. Does it mean the Informer is failing to watch K8s Services? Or is this something else?

With a quick google-foo it looks like K8s Controllers are failing for other people for objects like *v1.Deployment. So the hunch might be right that this is about watching K8s Services.

Maybe then there's a "networking issue" on talking to the K8s API?
https://community.dynatrace.com/t5/Container-platforms/Kubernetes-operator-fails-to-start/m-p/224724

Deploying the training-operator as a Deployment

#161 and #162 show the required changes for this to happen. While the charm has to be refactored, the reality is that this is the easiest path and the one suggested by the juju team.

I have tested this approach with training jobs, and even the upgrade path, which does not present any issue. The only minor concern is that it'll be a bit of downtime between upgrades because the training-operator Pod has to be restarted; other than that, I did not really found issues.

Using pebble to push the Secret contents

@kimwnasptd thanks for the pointers, I need to dig deeper on this one. I wanted to have a first approach at the solution suggested by the juju team (the one above) to at least have something ready. I need to allocate a bit more time to understand what could be causing the cache issue we are seeing.

Took a very quick look on the code, as the message for the informers caught my attention.

My hunch is that this shouldn't have something to do with the certificates. Specifically, this is the code execution path I see, based on the logs above

  1. The code runs the certificates logic and this succeeds
    2024-05-29T13:02:59.670Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   certs are ready in /tmp/k8s-webhook-server/serving-certs
    2024-05-29T13:02:59.769Z [training-operator] 2024-05-29T13:02:59Z       INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
    2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   no cert refresh needed
    2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   Ensuring CA cert        {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
    2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   no cert refresh needed
    2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z       INFO    cert-rotation   Ensuring CA cert        {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
    2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z       INFO    cert-rotation   CA certs are injected to webhooks
    2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z       INFO    setup   Certs ready
    
    https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L175-L177
  2. The code proceeds to setup a Controller for each supported CRD
    https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L196-L208
  3. The first CRD, from supported schemes, is TFJob
    https://github.com/kubeflow/training-operator/blob/v1.8-branch/pkg/controller.v1/register_controller.go#L37-L56
  4. Looking at the setup of the tensorflow controller we see it setsup a watch for TFJobs, Pods and Services
    https://github.com/kubeflow/training-operator/blob/v1.8-branch/pkg/controller.v1/tensorflow/tfjob_controller.go#L198-L203
    2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
    2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
    2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
    2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z       INFO    Starting Controller     {"controller": "tfjob-controller"}
    
  5. Proceeds to setup webhooks for this CRD (this one has only validating ones)
    https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L210-L219
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.builder      skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called       {"GVK": "kubeflow.org/v1, Kind=TFJob"}
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "kubeflow.org/v1, Kind=TFJob", "path": "/validate-kubeflow-org-v1-tfjob"}
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.webhook      Registering webhook     {"path": "/validate-kubeflow-org-v1-tfjob"}
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    controller-runtime.webhook      Starting webhook server
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Stopping and waiting for non leader election runnables
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    cert-rotation   stopping cert rotator controller
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    All workers finished    {"controller": "cert-rotator"}
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       INFO    Stopping and waiting for leader election runnables
    
  6. The controller crashes because the watch for K8s Services (Informer) times out
    2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z       ERROR   controller-runtime.source.EventHandler  failed to get informer from cache       {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}
    

So from the above I understand that it doesn't have something to do with the webhooks, but something else on being able to watch K8s Services

@DnPlas Slightly unrelated to the previous comments here, but we should mention in the contributing.md how to update manifests section that we should also update the pytorchjob_image on new releases in the training-operator UAT

Summary

  1. juju does not provide any means for mounting Kubernentes Secrets the way the training-operator and the cert-controlelr controllers would expect. When a Secret is mounted to a Kubernetes volume, Kubernetes itself will make sure the Secret contents (as files) are always up to date. With the tools we have in the juju ecosystem we'd be able to push the Kubernetes Secret contents into the container, but the way to keep those up to date can be "awkward".
  2. Using pebble and pebble notices for waking up the charm every time the Kubernetes Secret contents change can be a solution for the above, but since we do not know when the Kubernetes Secret is going to change, the notice may arrive too late for the charm to push the updated contents, potentially affecting the controller.
  3. Even pushing the Kubernetes Secret contents did not help because the error exposed here.
  4. The juju team has suggested against patching the StatefulSet to add the Secret mount.
  5. The juju team, while saying it is far from ideal, has mentioned we could deploy the workload using Deployments or StatefulSets managed by the charm. Similar to what we do for the istio-gateway.

Conclusion

Based on the above, and because of the issues we saw in the past, we have concluded that for now it is better (and easier) to deploy the workload using a Deployment that we can edit and modify whenever it is required. In the future, when the training-operator project decides to go away from using the cert-controller, we should refactor this charm to make it sidecar again.

Upgrade path

Please note that because of https://bugs.launchpad.net/juju/+bug/1991955, upgrading training-operator 1.7 -> 1.8 will result in the charm Pod having two containers instead of just one (as we refactored the charm).

While the issue has been filed in juju, we are not certain this will get fixed soon, so as workarounds we know that:

  • We could manually edit the charm's StatefulSet to get rid of the unwanted container
  • We could remove the application entirely and re-deploy it - less desirable