Image Tag Mutating Webhook Causes Existing Deployments to go in Creation Loop
SUSTAPLE117 opened this issue · 5 comments
Description
When tagging the namespace our apps(deployed with a tag) are deployed in to validate our image signatures we encounter an issue.
We are experiencing a cascade of ReplicaSet and pod creation in our Kubernetes cluster, which we believe is triggered by the mutating webhook designed to convert image tags to digests while validating container image signatures.
The webhook seems to be causing an unexpected behavior: when new pods are created (e.g., by the HPA), their image tags are replaced with digests, leading to a mismatch between the actual pod specifications and the expected specifications in the Deployment's template.
Consequently, the deployment controller attempts to rectify this perceived discrepancy by creating new ReplicaSets. However, it seems something is not correct in the way the change is made and get a bunch of repeated 'already exists' errors and a loop of ReplicaSet creation and deletion, overwhelming the cluster.
Example API logs we see:
Found a hash collision for deployment "findings-processor" - bumping collisionCount (4573->4574) to resolve it
"Error syncing deployment" deployment="default/findings-processor" err="replicasets.apps "findings-processor-888599648" already exists"
I1120 16:33:25.380980 12 deployment_controller.go:490] "Error syncing deployment" deployment="default/findings-processor" err="replicasets.apps "findings-processor-888599648" already exists"
Here is the policy-controller logs: policy-controller_logs.json
Version
Kubernetes: 1.24 on EKS
policy-controller: 0.8.2
Looking at the logs, it seems like the Deployments are not getting patched (there are no patchbytes, it gets called however), or already has been patched (and therefore, there's no need to patch again). ReplicaSet / Pods are getting patched though. I'm confused why when the Deployment gets created, it does not get patched (or is not seen in the logs). If the Deployment gets patched correctly, then the RS (and therefore pods) should get the digest and not the tag. Can you verify that the Deployment has been correctly patched?
@vaikas This is existing deployments that exist in the namespace we tag with policy.sigstore.dev/include=true
. These cascading creations happen pretty much immediately because of scaling up/down of deployments are constantly happening. Is there any guidance on how to deploy this to existing deployments? We would have to manually patch all deployments to trigger the webhook?
@SUSTAPLE117 I'd recommend to deploy the policy-controller and then label the target namespaces before deploying any resources on them. By the way there is a feature to match certain resources or resources with labels https://docs.sigstore.dev/policy-controller/overview/#policies-matching-specific-resource-types-and-labels.
@hectorj2f that makes a lot of sense. Thank you!
@hectorj2f Hi I came back to this and used the match feature to only target the deployments as follow:
spec:
authorities:
...
match:
- group: apps
resource: deployments
version: v1
mode: warn
However the MutatingWebhook doesn't respect the match and still patches ReplicaSets and Pods to translate the image tag to digest. Is that the expected behavior? If so it's not easy to do gradual rollouts of PolicyController in an existing env. unless I'm missing something?