fluxcd/helm-controller

Helm-controller pod is using stale tokens

albertschwarzkopf opened this issue ยท 17 comments

Hi,

the "Bound Service Account Token Volume" is graduated to stable and enabled by default in Kubernetes version 1.22.
I am using "helm-controller:v0.21.0" in AWS EKS 1.22 and I have checked, if it is using stale tokens (regarding https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html and https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshooting-boundservicetoken).

So when the API server receives requests with tokens that are older than one hour, then it annotates the pod with "annotations.authentication.k8s.io/stale-token". In my case I can see the following annotation. E.g.:
"annotations":{"authentication.k8s.io/stale-token":"subject: system:serviceaccount:flux-system:helm-controller, seconds after warning threshold: 56187"

Version:

helm-controller:v0.21.0

Cluster Details

AWS EKS 1.22

Steps to reproduce issue

  • Enable EKS Audit Logs
  • Query CW Insights (select cluster log group):
fields @timestamp
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime

@albertschwarzkopf can you please confirm this happens with kustomize-controller also?

@stefanprodan thanks for the fast reply!

No helm-controller only.
kustomize-controller is running in version 0.25.0

Also no issue with notification-controller:v0.23.5 and source-controller:v0.24.4

Does kustomize-controller runs on the the same node as helm-controller? Can you please post here kubectl -n flux-system get pods -owide

No there are running on different nodes at this moment (we have several nodes).

grafik

I see that kustomize-controller was restarted recently, wait one hour and report back please if kustomize-controller runs into the same issue. I'm trying to figure out if this is something specific to helm-controller or is a general problem with Kubernetes client-go on EKS.

pjbgf commented

I see that kustomize-controller was restarted recently, wait one hour and report back please if kustomize-controller runs into the same issue. I'm trying to figure out if this is something specific to helm-controller or is a general problem with Kubernetes client-go on EKS.

After 72 minutes no issue with kustomize-controller...

I've created an EKS cluster:

$ kubectl version
Server Version: v1.22.6-eks-14c7a48

I've waited one hour:

$ kubectl -n flux-system get po
NAME                                       READY   STATUS    RESTARTS   AGE
helm-controller-88f6889c6-pwf7f            1/1     Running   0          73m
kustomize-controller-784bd54978-bckm6      1/1     Running   0          73m
notification-controller-648bbb9db7-58c2d   1/1     Running   0          73m
source-controller-79f7866bc7-k25z5         1/1     Running   0          73m

And there is no stale-token annotation on the pod:

$ kubectl -n flux-system get po helm-controller-88f6889c6-pwf7f -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    container.seccomp.security.alpha.kubernetes.io/manager: runtime/default
    kubernetes.io/psp: eks.privileged
    prometheus.io/port: "8080"
    prometheus.io/scrape: "true"
  creationTimestamp: "2022-05-10T10:08:59Z"
  generateName: helm-controller-88f6889c6-
  labels:
    app: helm-controller
    pod-template-hash: 88f6889c6
  name: helm-controller-88f6889c6-pwf7f
  namespace: flux-system

Yes I can confirm this. Maybe it is visible only in the Audit Logs:

grafik

@albertschwarzkopf can you give the first mentioned image in #480 a try, and if that does not yield results, the second?

@hiddeco thanks! I have tried both images today. Image ghcr.io/hiddeco/helm-controller:head-412201a has worked like expected only. So I cannot see the mentioned annotation in the audit logs even after 1 hour.

Thanks for confirming. I'll finalize the PR in that case, and make sure it is included in next release.

Note we even got an automated email about this from aws!

As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format : |namespace:serviceaccount

arn:aws:eks:eu-west-2::cluster/prod-|kube-system:multus
arn:aws:eks:eu-west-2:
:cluster/prod-**|flux-system:helm-controller

This also totally explains fluxcd/flux2#2074 ( and the correlation between multus + helm we saw )

Got same message from AWS. Only helm-controller SA was flagged. All controllers are running for the same period of time.

NAME                                           READY   STATUS    RESTARTS      AGE
helm-controller-5676d55dff-7lgvn               1/1     Running   0             16d
image-automation-controller-6444ccb58c-8xcls   1/1     Running   0             16d
image-reflector-controller-f64677dd5-974qs     1/1     Running   0             16d
kustomize-controller-76f9d4f99f-htp8d          1/1     Running   0             16d
notification-controller-846fff6d67-h677q       1/1     Running   0             16d
source-controller-55d799ff7d-w598g             1/1     Running   0             16d

We got the notification message from AWS as well, but just for the helm-controller, albeit all pods are up and running 85 days long

I can confirm same problem here on EKS v1.22.6-eks-7d68063. Not sure if it's interesting or related, but, after moving to EKS 1.22 authentication for client changed from client.authentication.k8s.io/v1alpha1 to client.authentication.k8s.io/v1beta1 .

As already mentioned in #479 (comment). We have identified the issue, staged a patch, and this will be solved on next release.