knative/serving

fix using KNative with ArgoCD (don't set `ownerReferences` for webhooks)

thesuperzapper opened this issue · 15 comments

Background

In 2021 KNative made it so that the ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources were "owned" by the Namespace which KNative is installed into (kanative-serving by default):

This was intended to ensure that users did not leave the webhooks when uninstalling (via deleting the Namespace) and break KNative when they re-installed (because the webhooks are cluster resources, and their backend would not exist and so would fail to validate anything).

Kubernetes has the concept of ownerReferences to indicate the relationships between resources. If an ownerReference sets blockOwnerDeletion, kubernetes will clean up these "child" resources before/after the "parent" resources is deleted (before: foreground delete, after: background delete).

For example, a Pod owned by a ReplicaSet might have the following ownerReferences:

apiVersion: v1
kind: Pod
metadata:
  name: xxxxxx
  namespace: xxxxxx
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: xxxxxx-759d8cb89
      uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Whats the problem?

This breaks ArgoCD, a very widely used GitOps system for Kubernetes.

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Here are some related upstream issues:

What's the solution?

I propose we make two changes:

  1. Add a config which will disable the behavior of setting the ownerReferences:
    • For example, an environment variable like KNATIVE_DISABLE_WEBHOOK_OWNER which can be set on all controller pods.
  2. When we do set ownerReferences, we should NOT be setting controller=true:
    • This is much less likely to break downstream projects (obviously the Namespace is not the controlling resource). It also allows KNative to work with the upstream patch for ArgoCD that ignores non-controller ownerReferences.

Where is the relevant code?

The code which sets the ownerReferences lives in the knative-pkg libraries:

@dprotaso @skonto @creydr thanks for you work on KNative! I wanted to highlight this important issue as it causes KNative to break when deployed with ArgoCD because the webhooks are not removed on uninstall.

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

@dprotaso there are a few issues with that:

  1. Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)
  2. Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

Unsure if pkg/3095 closes this issue - since there's no end-user knob to disable the owner references

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

This is a common pattern, especially in app-of-apps deployments.

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

@dprotaso think you might be slightly misunderstanding.

What is happening is that ArgoCD will never delete the ValidatingWebhookConfiguration / MutatingWebhookConfiguration resources (because they have the owner reference), however, it will delete the other resources, including the back-end services for the webhooks.

This means that because the webhooks have a failurePolicy: Fail, when that the back end is deleted, any operations which match the webhook selectors are now blocked (because the backend is down).

The easiest way to reproduce this is simply create an ArgoCD app for KNative that has the resources-finalizer.argocd.argoproj.io cascading deletion finalizer, sync it, then try delete it.

This cascading delete will deadlock 100% of the time in the way described.

@dprotaso I have made a follow-up PR in knative/pgk that adds an environment variable to disable the Namespace owner reference from being set in KNative Serving:

It adds an environment variable called WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP which sets the DisableNamespaceOwnership config that was added by @jonathan-innis in:

Next Steps

I think we should also backport it to 1.14 and 1.15 because otherwise its nearly impossible to deploy KNative Serving with ArgoCD. We might also want to add a basic reference to this in the docs (or hope people stumble across this issue).

That means we will need to cherry-pick the two PRs into the release-1.14 and release-1.15 of knative/pgk, as we can't just update the versions used to master.

We need to update the version of knative/pkg used in the following places:

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

eg. Long term it would be great if we could do the ownership references declaratively kubernetes/kubernetes#102810

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

@dprotaso I think that both KNative and ArgoCD are doing something unexpected here.

While I agree that ArgoCD should have a workaround for apps that misuse the OwnerReferences, I think that we (KNative) are actually misusing OwnerRefernces by setting the controller: true flag. That is, the Namespace is not the "controlling" resource for the webhook configuration, so it's confusing ArgoCD which assumes that people never want to delete a resource that is controlled by another.

ArgoCD does this because deleting an owned resource is usually dangerous. For example, when using cert-manager, ArgoCD should never delete a secret that is managed by a Certificate resource (which is indicated by the Certificate "owning" the Secret).

eg. Long term it would be great if we could do the ownership references declaratively kubernetes/kubernetes#102810

I doubt that upstream Kubernetes would accept this because the point of OwnerRefernces is for controllers, not end users.

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

I really think this is a bug (from the KNative perspective) and since we are still officially supporting 1.14 it warrants a patch release.

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

Looks like there's already a PR out for that - argoproj/gitops-engine#503

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

@dprotaso sadly no, but we should still do that change anyway, because it's more semantically correct. At some point ArgoCD might change to ignoring non-controller owners (but there is no consensus in the ArgoCD community yet on if this would introduce it's own problems).

So in the meantime, it would be great to get knative/pkg#3103 merged at at least made part of 1.15 so people can deploy kNative safely from ArgoCD.

@dprotaso there was a slight issue with my first PR, but I have made a quick followup knative/pkg#3107 which definitely works in my testing.

That is, when the WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP env-var is set to "true" on the "webhook" Deployments (running versions of Knative Serving with this patch), the ownerReferences are not set on the webhook configuration resources.

For example, you can apply the following Kustomize patch to do this:

patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: REPLACED_DURING_PATCH
      spec:
        template:
          spec:
            containers:
              - name: webhook
                env:
                  - name: WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP
                    value: "true"
    target:
      group: apps
      kind: Deployment
      name: ".*webhook.*"

@thesuperzapper should we close this is done?

@skonto it looks like this made its way into serving 1.16.0 and net-istio 1.16.0, so yes, we can probably close it.