aws/karpenter-provider-aws

Upgrading from 0.37.0 to 0.37.4 produces service account issues

Opened this issue · 8 comments

Description

When we tested upgrading our Argo managed install of Karpenter from 0.37.0 to 0.37.4, it produced errors in the controller about missing role permissions for the service account.
example error:
k8s.io/client-go@v0.30.1/tools/cache/reflector.go:232: Failed to watch *v1.ConfigMap: failed to list *v1. │ │ ConfigMap: configmaps is forbidden: User \"system:serviceaccount:kube-system:karpenter\" cannot list resource \"configmaps\" in API group \"\" in the namespace \"kube-system\".

Webhooks are disabled in the chart configuration.

Not sure if it's related, but when I tried to manage Karpenter chart install with Kustomize, it produced another set of errors with missing role and rolebindings for the kube-node-lease namespace.

I confirm the issue, tried going from 0.37.2 to 0.37.5 .

It looks like the ConfigMap permissions are linked to the webhooks being enabled.

{{- if .Values.webhook.enabled }}
  - apiGroups: [""]
    resources: ["configmaps", "secrets"]
    verbs: ["get", "list", "watch"]
{{- end }}

But the container needs them at startup regardless of the webhook status....

These RBAC permissions should only be required when the Webhook is enabled, and have not changed between v0.37.0 and v0.37.5. Did you previously have Karpenter installed with webhooks enabled? There was a known issue with the interaction between ArgoCD and knative (what Karpenter uses for webhooks) that caused Argo to fail to prune MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources. This should have been addressed with v0.37.5. Are you both able to confirm if those resources are present in your cluster? Also, @adrianmiron are you able to confirm that the image for your Karpenter deployment is v0.37.5 (kubectl get deployments -n kube-system karpenter -ojsonpath='{.spec.template.spec.containers[*].image}')?

These RBAC permissions should only be required when the Webhook is enabled, and have not changed between v0.37.0 and v0.37.5. Did you previously have Karpenter installed with webhooks enabled? There was a known issue with the interaction between ArgoCD and knative (what Karpenter uses for webhooks) that caused Argo to fail to prune MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources. This should have been addressed with v0.37.5. Are you both able to confirm if those resources are present in your cluster? Also, @adrianmiron are you able to confirm that the image for your Karpenter deployment is v0.37.5 (kubectl get deployments -n kube-system karpenter -ojsonpath='{.spec.template.spec.containers[*].image}')?

I confirm the webhook resource cleanup is now fixed, but yes, with the 0.37.5 chart version and webhooks off i got the error on pod startup ( i changed nothing else besides the karpenter chart version in my wrapper chart , from 0.37.2 where i am on most of my clusters ).

I can double confirm with a printscreen tomorrow if that would help push things along.

Let's double confirm the environment variables and the the image version on your Karpenter deployment. Outside of webhooks, there shouldn't be anything watching configmaps or secrets. By disabling the webhooks, Karpenter should never make these list calls in the first place.

Nevermind, I've been able to reproduce the issue and root cause it. In Karpenter v0.37.3 webhooks were enabled by default. This consisted of two changes, a change to the chart in the AWS provider (#6900), and a change in the upstream repo to the CLI / environment variable parser (kubernetes-sigs/karpenter#1616). Each of these changes are fine individually, but when meshed they result in it being impossible to actually disable the webhooks. This is because the helm chart only sets the DISABLE_WEBHOOK env var when webhook.enabled is set to true.

{{- if .Values.webhook.enabled }}
- name: WEBHOOK_PORT
value: "{{ .Values.webhook.port }}"
- name: WEBHOOK_METRICS_PORT
value: "{{ .Values.webhook.metrics.port }}"
- name: DISABLE_WEBHOOK
value: "false"
{{- end }}

The helm chart needs to be updated to set DISABLE_WEBHOOK to true due to the upstream change to the default. As for workarounds until we get a patch release out, you can update the helm chart yourself to set the environment variable appropriately or rollback to v0.37.2.

I just enabled the webhooks , upgraded to 0.37.5, did the 1.0.X upgrade and then disabled them afterwards, in case anyone wants to use the existing versions and not wait for another release.

Hi @adrianmiron, thanks for the update. Before your upgrade was there a particular reason that prevented the update to v1 from these latest patches?

@rschalo Not sure i understand your question, but if you mean what issues i had going from 0.37.2 o 1.0.X ( which is when i last tried to do this ) , they were many and not really relevant any more ( the biggest thing was the IAM policy change and the fact that some of the permissions were not assigned to the controller, because of tags...etcetc )

If you mean something else entirely....please elaborate :)