Update `training-operator` manifests
Closed this issue · 10 comments
Context
Each charm has a set of manifest files that have to be upgraded to their target version. The process of upgrading manifest files usually means going to the component’s upstream repository, comparing the charm’s manifest against the one in the repository and adding the missing bits in the charm’s manifest.
What needs to get done
Definition of Done
- Manifests are updated
- Upstream image is used
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5692.
This message was autogenerated
The training-operator
recently introduced ValidatingWebhookConfigurations
(kubeflow/training-operator#1993) for the various training resources this controller manages. Under the hood, the training-operator
controller uses cert-controller for managing self-signed certificates that can be used in the webhook configuration.
The cert-controller
component is used as part of the business logic of the training-operator
, and in fact any potential error while managing the certificates, can cause the service to exit (see here).
It is important to note that the cert-controller
will use a Secret
as its source of truth for managing the certificates.
Based on the above information, the following were my approaches for adding support for these webhooks:
- Using the known
cert.py
script for generating self-signed certs and then using them for rendering webhooks is not an option because the actual workload controller expects the secret contents to be located in a specific volume in the training operator workload container; otherwise it will end up timing out waiting for the files. This line gets executed and the training operator service never starts correctly, more over, the webhook service never starts. - The
training-operator Deployment
(upstream) explicitly mounts theSecret
in aVolume mountPath
, which is used by thecert-controller
. Because of that, we can patch thetraining-operator StatefulSet
and mount the secret in the expectedvolumeMount
. - Manually placing the contents of the
Secret
in/tmp/k8s-webhook-server/serving-certs
is also not helping. At first glance it looks like while the files are there, the service does not seem to use them correctly.
Although I have tried all three different approaches, none of them seem to work correctly as the service (both the workload and the webhook) are not started correctly. I will have to do more testing on this and report back my findings.
Using pebble to push the Secret
contents
I have tried this method, but the workload service does not seem to start correctly. I'm doing the following:
# charm.py
def _upload_secret(self):
# --- Method 1, placing the secret contents in the directory
from lightkube.resources.core_v1 import Secret
webhook_secret = self.k8s_resource_handler.lightkube_client.get(Secret, name="training-operator-webhook-cert", namespace="kubeflow")
self.container.push(path="/tmp/k8s-webhook-server/serving-certs/ca.crt", make_dirs=True, source=webhook_secret.data['ca.crt'], permissions=0o420)
self.container.push(path="/tmp/k8s-webhook-server/serving-certs/ca.key", make_dirs=True, source=webhook_secret.data['ca.key'], permissions=0o420)
self.container.push(path="/tmp/k8s-webhook-server/serving-certs/tls.crt", make_dirs=True, source=webhook_secret.data['tls.crt'], permissions=0o420)
self.container.push(path="/tmp/k8s-webhook-server/serving-certs/tls.key", make_dirs=True, source=webhook_secret.data['tls.key'], permissions=0o420)
_upload_secret()
is called during the initial hooks.
The problem I have found with this is that while the workload service (cert-rotation
) is able to find the certificates in the right path, it times out looking for data in some sort of cache:
2024-05-29T13:02:59.670Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation certs are ready in /tmp/k8s-webhook-server/serving-certs
2024-05-29T13:02:59.769Z [training-operator] 2024-05-29T13:02:59Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1}
2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation no cert refresh needed
2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation Ensuring CA cert {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation no cert refresh needed
2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation Ensuring CA cert {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z INFO cert-rotation CA certs are injected to webhooks
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z INFO setup Certs ready
2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z INFO setup registering controllers...
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting Controller {"controller": "tfjob-controller"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.builder skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called {"GVK": "kubeflow.org/v1, Kind=TFJob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "kubeflow.org/v1, Kind=TFJob", "path": "/validate-kubeflow-org-v1-tfjob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-kubeflow-org-v1-tfjob"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.webhook Starting webhook server
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Stopping and waiting for non leader election runnables
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "cert-rotator"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO cert-rotation stopping cert rotator controller
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO All workers finished {"controller": "cert-rotator"}
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Stopping and waiting for leader election runnables
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}
This could be because the secret data is only pushed to the workload container once in the charm's lifetime. To solve this, we could add a pebble notice to the workload container that checks the state of the uploaded data vs the actual data in the Secret
, though I am not sure if this statement is entirely true. This has the following issues:
- The training-operator image is distroless - we would have to create a rock for it
- We'll be introducing a change that most likely we'll have to remove in the next version of the training-operator, as upstream is planning to use a different certificate generation method for the next release
Patching the StatefulSet
I have tried patching the StatefulSet
from within the charm, and this worked. The major problem with this is that patching StatefulSet
s in juju is a dangerous activity that can have unintended consequences, plus the juju agent may overwrite everything on restart.
Deploying the training-operator as a Deployment
There is a third option for solving this, which is to deploy the training-operator as a Deployment
, similar to what we do for the istio-gateway
charm. This will allow us to mount the secret w/o interfering with juju operations at all.
The only problem with this is that, yet again, the upstream project will change the certificates creation approach for next release, meaning that we'll change this charm for one version and re-factor it again for the next one.
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}
This error line is saying that the Informer of the controller-runtime
go package failed waiting to sync. From what I know, the informer is the go component that does "kubectl watch" to observe events for objects, put them in a cache of controller's K8s objects state, and then the reconciliation loop is triggered.
With a very diagonal look it seems weird what it means about *v1.Service
. Does it mean the Informer is failing to watch K8s Services? Or is this something else?
With a quick google-foo it looks like K8s Controllers are failing for other people for objects like *v1.Deployment
. So the hunch might be right that this is about watching K8s Services.
Maybe then there's a "networking issue" on talking to the K8s API?
https://community.dynatrace.com/t5/Container-platforms/Kubernetes-operator-fails-to-start/m-p/224724
Deploying the training-operator as a Deployment
#161 and #162 show the required changes for this to happen. While the charm has to be refactored, the reality is that this is the easiest path and the one suggested by the juju team.
I have tested this approach with training jobs, and even the upgrade path, which does not present any issue. The only minor concern is that it'll be a bit of downtime between upgrades because the training-operator Pod has to be restarted; other than that, I did not really found issues.
Using pebble to push the Secret
contents
@kimwnasptd thanks for the pointers, I need to dig deeper on this one. I wanted to have a first approach at the solution suggested by the juju team (the one above) to at least have something ready. I need to allocate a bit more time to understand what could be causing the cache issue we are seeing.
Took a very quick look on the code, as the message for the informers caught my attention.
My hunch is that this shouldn't have something to do with the certificates. Specifically, this is the code execution path I see, based on the logs above
- The code runs the certificates logic and this succeeds
https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L175-L177
2024-05-29T13:02:59.670Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation certs are ready in /tmp/k8s-webhook-server/serving-certs 2024-05-29T13:02:59.769Z [training-operator] 2024-05-29T13:02:59Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1} 2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation no cert refresh needed 2024-05-29T13:02:59.770Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation Ensuring CA cert {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation no cert refresh needed 2024-05-29T13:02:59.775Z [training-operator] 2024-05-29T13:02:59Z INFO cert-rotation Ensuring CA cert {"name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "validator.training-operator.kubeflow.org", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z INFO cert-rotation CA certs are injected to webhooks 2024-05-29T13:03:01.301Z [training-operator] 2024-05-29T13:03:01Z INFO setup Certs ready
- The code proceeds to setup a Controller for each supported CRD
https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L196-L208 - The first CRD, from supported schemes, is TFJob
https://github.com/kubeflow/training-operator/blob/v1.8-branch/pkg/controller.v1/register_controller.go#L37-L56 - Looking at the setup of the tensorflow controller we see it setsup a watch for TFJobs, Pods and Services
https://github.com/kubeflow/training-operator/blob/v1.8-branch/pkg/controller.v1/tensorflow/tfjob_controller.go#L198-L2032024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"} 2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"} 2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Service"} 2024-05-29T13:03:01.302Z [training-operator] 2024-05-29T13:03:01Z INFO Starting Controller {"controller": "tfjob-controller"}
- Proceeds to setup webhooks for this CRD (this one has only validating ones)
https://github.com/kubeflow/training-operator/blob/v1.8-branch/cmd/training-operator.v1/main.go#L210-L2192024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.builder skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called {"GVK": "kubeflow.org/v1, Kind=TFJob"} 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "kubeflow.org/v1, Kind=TFJob", "path": "/validate-kubeflow-org-v1-tfjob"} 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-kubeflow-org-v1-tfjob"} 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO controller-runtime.webhook Starting webhook server 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Stopping and waiting for non leader election runnables 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Shutdown signal received, waiting for all workers to finish {"controller": "cert-rotator"} 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO cert-rotation stopping cert rotator controller 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO All workers finished {"controller": "cert-rotator"} 2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z INFO Stopping and waiting for leader election runnables
- The controller crashes because the watch for K8s Services (Informer) times out
2024-05-29T13:03:01.303Z [training-operator] 2024-05-29T13:03:01Z ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1.Service Informer to sync"}
So from the above I understand that it doesn't have something to do with the webhooks, but something else on being able to watch K8s Services
@DnPlas Slightly unrelated to the previous comments here, but we should mention in the contributing.md how to update manifests section that we should also update the pytorchjob_image on new releases in the training-operator UAT
Summary
- juju does not provide any means for mounting Kubernentes
Secrets
the way thetraining-operator
and thecert-controlelr
controllers would expect. When aSecret
is mounted to a Kubernetes volume, Kubernetes itself will make sure theSecret
contents (as files) are always up to date. With the tools we have in the juju ecosystem we'd be able to push the KubernetesSecret
contents into the container, but the way to keep those up to date can be "awkward". - Using pebble and pebble notices for waking up the charm every time the Kubernetes
Secret
contents change can be a solution for the above, but since we do not know when the KubernetesSecret
is going to change, the notice may arrive too late for the charm to push the updated contents, potentially affecting the controller. - Even pushing the Kubernetes
Secret
contents did not help because the error exposed here. - The juju team has suggested against patching the
StatefulSet
to add theSecret
mount. - The juju team, while saying it is far from ideal, has mentioned we could deploy the workload using Deployments or StatefulSets managed by the charm. Similar to what we do for the istio-gateway.
Conclusion
Based on the above, and because of the issues we saw in the past, we have concluded that for now it is better (and easier) to deploy the workload using a Deployment
that we can edit and modify whenever it is required. In the future, when the training-operator project decides to go away from using the cert-controller
, we should refactor this charm to make it sidecar again.
Upgrade path
Please note that because of https://bugs.launchpad.net/juju/+bug/1991955, upgrading training-operator 1.7 -> 1.8
will result in the charm Pod
having two containers instead of just one (as we refactored the charm).
While the issue has been filed in juju, we are not certain this will get fixed soon, so as workarounds we know that:
- We could manually edit the charm's
StatefulSet
to get rid of the unwanted container - We could remove the application entirely and re-deploy it - less desirable