Upgrading from 0.37.0 to 0.37.4 produces service account issues
Opened this issue · 8 comments
Description
When we tested upgrading our Argo managed install of Karpenter from 0.37.0 to 0.37.4, it produced errors in the controller about missing role permissions for the service account.
example error:
k8s.io/client-go@v0.30.1/tools/cache/reflector.go:232: Failed to watch *v1.ConfigMap: failed to list *v1. │ │ ConfigMap: configmaps is forbidden: User \"system:serviceaccount:kube-system:karpenter\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\"
.
Webhooks are disabled in the chart configuration.
Not sure if it's related, but when I tried to manage Karpenter chart install with Kustomize, it produced another set of errors with missing role and rolebindings for the kube-node-lease namespace.
I confirm the issue, tried going from 0.37.2 to 0.37.5 .
It looks like the ConfigMap permissions are linked to the webhooks being enabled.
{{- if .Values.webhook.enabled }}
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
{{- end }}
But the container needs them at startup regardless of the webhook status....
These RBAC permissions should only be required when the Webhook is enabled, and have not changed between v0.37.0
and v0.37.5
. Did you previously have Karpenter installed with webhooks enabled? There was a known issue with the interaction between ArgoCD and knative (what Karpenter uses for webhooks) that caused Argo to fail to prune MutatingWebhookConfiguration
and ValidatingWebhookConfiguration
resources. This should have been addressed with v0.37.5
. Are you both able to confirm if those resources are present in your cluster? Also, @adrianmiron are you able to confirm that the image for your Karpenter deployment is v0.37.5 (kubectl get deployments -n kube-system karpenter -ojsonpath='{.spec.template.spec.containers[*].image}'
)?
These RBAC permissions should only be required when the Webhook is enabled, and have not changed between
v0.37.0
andv0.37.5
. Did you previously have Karpenter installed with webhooks enabled? There was a known issue with the interaction between ArgoCD and knative (what Karpenter uses for webhooks) that caused Argo to fail to pruneMutatingWebhookConfiguration
andValidatingWebhookConfiguration
resources. This should have been addressed withv0.37.5
. Are you both able to confirm if those resources are present in your cluster? Also, @adrianmiron are you able to confirm that the image for your Karpenter deployment is v0.37.5 (kubectl get deployments -n kube-system karpenter -ojsonpath='{.spec.template.spec.containers[*].image}'
)?
I confirm the webhook resource cleanup is now fixed, but yes, with the 0.37.5 chart version and webhooks off i got the error on pod startup ( i changed nothing else besides the karpenter chart version in my wrapper chart , from 0.37.2 where i am on most of my clusters ).
I can double confirm with a printscreen tomorrow if that would help push things along.
Let's double confirm the environment variables and the the image version on your Karpenter deployment. Outside of webhooks, there shouldn't be anything watching configmaps or secrets. By disabling the webhooks, Karpenter should never make these list calls in the first place.
Nevermind, I've been able to reproduce the issue and root cause it. In Karpenter v0.37.3
webhooks were enabled by default. This consisted of two changes, a change to the chart in the AWS provider (#6900), and a change in the upstream repo to the CLI / environment variable parser (kubernetes-sigs/karpenter#1616). Each of these changes are fine individually, but when meshed they result in it being impossible to actually disable the webhooks. This is because the helm chart only sets the DISABLE_WEBHOOK
env var when webhook.enabled
is set to true.
karpenter-provider-aws/charts/karpenter/templates/deployment.yaml
Lines 79 to 86 in f587167
The helm chart needs to be updated to set DISABLE_WEBHOOK
to true
due to the upstream change to the default. As for workarounds until we get a patch release out, you can update the helm chart yourself to set the environment variable appropriately or rollback to v0.37.2.
I just enabled the webhooks , upgraded to 0.37.5, did the 1.0.X upgrade and then disabled them afterwards, in case anyone wants to use the existing versions and not wait for another release.
Hi @adrianmiron, thanks for the update. Before your upgrade was there a particular reason that prevented the update to v1 from these latest patches?
@rschalo Not sure i understand your question, but if you mean what issues i had going from 0.37.2 o 1.0.X ( which is when i last tried to do this ) , they were many and not really relevant any more ( the biggest thing was the IAM policy change and the fact that some of the permissions were not assigned to the controller, because of tags...etcetc )
If you mean something else entirely....please elaborate :)