fluxcd/flux

EKS v1.22 upgrade triggers Operational Notification from AWS regarding BoundServiceAccountToken

mariusmitrofan opened this issue · 13 comments

Describe the bug

After upgrading to AWS EKS v1.22 we've received the following operational notification from AWS.

They're basically saying that the token needs to be refreshed when using service accounts with Kubernetes SDK and that the methods of refreshing the BoundServiceAccountToken have already been provided in their respective SDKs.

The new SDKs that should be used are:

  • Go v0.15.7 and later
  • Python v12.0.0 and later
  • Java v9.0.0 and later
  • Javascript v0.10.3 and later
  • Ruby master branch
  • Haskell v0.3.0.0

I understand that v1 is in maintenance mode, but since this simply requires an upgrade of SDK, I was hoping you guys can help.

See below full message from AWS:

Event type code: AWS_EKS_OPERATIONAL_NOTIFICATION
We have identified applications running in one or more of your Amazon EKS clusters that are not refreshing service account tokens. Applications making requests to Kubernetes API server with expired tokens will fail. You can resolve the issue by updating your application and its dependencies to use newer versions of Kubernetes client SDK that automatically refreshes the tokens.

What is the problem?
Kubernetes version 1.21 graduated BoundServiceAccountTokenVolume feature [1] to beta and enabled it by default. This feature improves security of service account tokens by requiring a one hour expiry time, over the previous default of no expiration. This means that applications that do not refetch service account tokens periodically will receive an HTTP 401 unauthorized error response on requests to Kubernetes API server with expired tokens. You can learn more about the BoundServiceAccountToken feature in EKS Kubernetes 1.21 release notes [2].

To enable a smooth migration of applications to the newer time-bound service account tokens, EKS v1.21+ extends the lifetime of service account tokens to 90 days. Applications on EKS v1.21+ clusters that make API server requests with tokens that are older than 90 days will receive an HTTP 401 unauthorized  error response.

How can you resolve the issue?
To make the transition to time bound service account tokens easier, Kubernetes has updated the below official versions of client SDKs to automatically refetch tokens before the one hour expiration:
* Go v0.15.7 and later
* Python v12.0.0 and later
* Java v9.0.0 and later
* Javascript v0.10.3 and later
* Ruby master branch
* Haskell v0.3.0.0

We recommend that you update your application and its dependencies to use one of the above client SDK versions if you are on an older version.
While not an exhaustive list, the below AWS components have been updated to use the newer Kubernetes client SDKs that automatically refetches the token :
* Amazon VPC CNI: v1.8.0 and later
* CoreDNS: v1.8.4 and later
* AWS Load Balancer Controller: v2.0.0 and later
* kube-proxy: v1.21.2-eksbuild.2 and later

[yada-yada-yada]

We recommend that you update your applications and its dependencies that are using stale service accounts tokens to use one of the newer Kubernetes Client SDKs that refetches tokens.
If the service account token used is close to expiry (<90 days) and you do not have sufficient time to update your client SDK versions before expiry, then you can terminate existing pods and create new ones. This results in refetching of the service account token, giving you additional time (90 days) to update your client SDKs.

Affected resources:
[OBFUSCATED_CLUSTER_ARN]|flux:flux

Steps to reproduce

Install flux in v1.22 EKS cluster

Expected behavior

Notification should not be pushed from AWS

Kubernetes version / Distro / Cloud provider

Amazon EKS v1.22

Flux version

Flux v1.25.0 / Helm chart v1.12.0

Git provider

No response

Container Registry provider

No response

Additional context

No response

Maintenance Acknowledgement

  • I am aware of Flux v1's maintenance status

Code of Conduct

  • I agree to follow this project's Code of Conduct

Got same email from AWS. Both flux and helm-operator SA are flagged as using stale tokens.

Thanks for reporting this. We are having some internal discussion about it. The issue is acknowledged.

FWIW, the client SDK is already at version 0.21.x:

flux/go.mod

Lines 23 to 30 in 9615263

// Pin kubernetes dependencies to 1.21.3
replace (
k8s.io/api => k8s.io/api v0.21.3
k8s.io/apiextensions-apiserver => k8s.io/apiextensions-apiserver v0.21.3
k8s.io/apimachinery => k8s.io/apimachinery v0.21.3
k8s.io/client-go => k8s.io/client-go v0.21.3
k8s.io/code-generator => k8s.io/code-generator v0.21.3
)

The related issue here from Flux v2:

The fix looks to be quite extensive, with more than just a dependency upgrade. (In conflict with this):

To make the transition to time bound service account tokens easier, Kubernetes has updated the below official versions of client SDKs to automatically refetch tokens before the one hour expiration:

  • Go v0.15.7 and later

It's not clear to me what needs to be upgraded, but I'd like to note this issue is being addressed in the Flux v2 repos and a fix has already been merged. So, it will be solved on the next release of Flux v2.

As with any issue that gets escalated here, the first thing we recommend is planning your upgrade to Flux v2.

The internal discussion is ongoing about how to address this and how to best honor the migration timetable. Please bear with us!

After some research there seems to be API server flag --service-account-extend-token-expiration which actually extends the token lifetime to 1 year even if it is reported as stale. This is enabled by default in vanilla k8s. It seems that AWS just shorten lifetime to 90d.

Another workaround is to restart the Pod so new token gets mounted.

Received the same AWS email

Does flux 1.24.* or 1.25.* fixed this issue?

No, this issue is still open for now @yiyan-wish

Hi,

Is there a timeline for this fix?

pjbgf commented

We are aiming to have a release done between the end of this week and beginning of next.

Just a note, this isn't covered in the v1.25.2 release that's being pushed out right now.

I'm marking it for v1.25.3 milestone which could come out next week or later. We have not isolated a PR to fix this issue yet.

I don't want to ask this the wrong way, but is anyone monitoring/experiencing this issue? We need to know if there has been any workaround from either end, or whether it still affects some users to decide how much priority is needed for the fix.

My understanding is that one remediation is to monitor the logs for the error condition and/or restart Flux pod once every 30-60 days to avoid the failure which will eventually manifest after 90 days.

The condition also reports on AWS side as an operational notification. I have a hard time prioritizing this as, for today, users who still run Flux v1 should have an operational notification on their dashboard that it is time to upgrade. It sounds like this is an issue which does not affect anyone until 90 days have passed, then if their Flux is running unmonitored and it's not restarted, it will stop to function with respect to any service that is accessed through the BoundServiceAccountToken (I guess that is the ECR repository for image update auto.) So again to clarify, it is definitely a bug if Flux behaves this way.

I would love to have that issue fixed for Flux v1.25.3 but I do not have a test environment on AWS so have not been able to reproduce it myself as of yet. If it can be reproduced then we can say for sure how to upgrade and then hopefully get it resolved quickly.

Still yet I want to emphasize at this point we caution any new users against choosing Flux v1 and aim to get everyone safely to shore so that Flux v1 support can one day be closed. The migration timetable should clarify that Flux v2 has been the recommended choice for well over a year: https://fluxcd.io/docs/migration/timetable/ – the question I have at this point, since we have updated all dependencies that I am aware of, the Kubernetes go client is at v0.21.3 which is definitely > v0.15.7 so I'm not sure what else needs to be upgraded.

Has anyone experiencing this issue tried an upgrade to Flux v2 and found the issue is or isn't present there?

As far as what we've tried and what else remains on the table: we cannot upgrade to Kubernetes client v0.22.0+ without a breaking change, as that client version drops many beta APIs from older Kubernetes versions, and while Flux v1 may ultimately do this upgrade while it remains in maintenance, it will be a breaking change for anyone who upgrades Flux but fails to upgrade their Kubernetes cluster.

It is hard to imagine this person who wants the latest Flux image but refuses to upgrade their Kubernetes version after the release they are on is past EOL, so I think it makes sense to eventually publish a release that does have this upgrade, maybe when Kubernetes v1.26 or v1.27 is released and all Kubernetes releases that carry any v1beta APIs are long since past EOL.

But we have yet to have any discussion about when this breaking upgrade should take place in Flux v1, or even if it should happen at all to prolong the support (rather than marking this repo as read-only and hoping that users follow us to Flux v2)

Still the link you quoted only says an upgrade to v0.15.7 was needed, so I'm really not sure what is left to be upgraded. We're well past that mark and I'm not sure what else to try (in the context of no repro environment, this should be clear!)

If a fix is possible for this issue, we'll be glad to merge it in for 1.25.3, I've scheduled the release for about 2 weeks out now. We can push it out sooner if the fix is known, please submit it as a PR and we will review it ASAP. Otherwise, I just want to make sure that a release goes out periodically, as a courtesy to users, so that we can see a clean result on the security scans for any CVE dependency alerts that inevitably crop up from time to time.

I am no longer able to spend as much time on Flux v1 maintenance myself, but @pjbgf has been taking up the mantle and is in the loop for releases now, while I'll be continuing to monitor issues as they are reported here. 👍

pjbgf commented

We have recently upgraded all dependencies again (including AWS SDK) and shall release a new version in the coming days. That is mostly to mitigate security vulnerabilities and we have no reason to believe it will fix the issue given that the requirements in terms of SDK/Dependency versions were met a long time ago. But either way, I would recommend users to give it a try.

A workaround for this issue is to force the pod to be restarted before the token expiration period (in EKS' case that is 90 days). That could be achieved with a Kubernetes CronJob or by simply redeploying Flux within that time frame.

Unfortunately, we won't be able to fix the root cause of this issue as Flux v1 is in Migration and security support only. We recommend users to migrate to Flux 2 at their earliest convenience so they don't encounter other issues as newer Kubernetes versions may degrade Flux v1 operations.

More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/.