kubernetes-retired/kube-aws

Token expired and kube-aws exited when updating all instances

PabloCastellano opened this issue · 10 comments

Hello.

I've hit an issue today while trying to update my cluster composed of 90 instances and two nodepools. I wanted to update the instance family of both nodepools, hence requiring a slow replacement of all the instances running.

Around exactly one hour later, kube-aws exited with error code 2 and showing the following message:

Error: Error updating cluster: ExpiredToken: The security token included in the request is expired
	status code: 403, request id: 901b76e8-814b-11e9-82e1-f33a99ce5b0c

However the upgrading process did not immediately stop on AWS side and I thought it was all going good, but suddently:

The following resource(s) failed to update: [Workers]. 

Received 24 SUCCESS signal(s) out of 37. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

Not sure if I understand the full picture but I guess that if kube-aws handled the token renewal gracefully instead of crashing, it would not happen.

FWIW, I'm using kube-aws v0.9.9 and I am aware that this is pretty old now (1.5 years at the time of writing this) but I have dug into the code and haven't found any change since then to handle expired tokens in the master branch.

The following workaround worked for me:

  1. In kube-aws, add new nodepools
  2. In AWS, scale up new autoscaling groups to the same value as old autoscaling groups
  3. In AWS, scale down old autoscaling groups to 0, so that workloads get migrated to the new ones (this might take a while)
  4. In kube-aws, remove old nodepools

Hey @PabloCastellano,

You're right, this looks like it may be an issue still. Want to take a crack at fixing it?

@dominicgunn I'm happy to help but I need some guidance. Where in the code would you handle the token renewal?

Sorry for taking a while to get back to you @PabloCastellano,

I'd take a look at awsconn.go, and perhaps cluster.go.

awsconn.go is currently responsible for creating the session, to it may make sense to try and provide some functionality there for ensuring it doesn't expire to keep as much of the session code in one place as possible.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.