Token expired and kube-aws exited when updating all instances

Question

Token expired and kube-aws exited when updating all instances

PabloCastellano opened this issue 5 years ago · 10 comments

Hello.

I've hit an issue today while trying to update my cluster composed of 90 instances and two nodepools. I wanted to update the instance family of both nodepools, hence requiring a slow replacement of all the instances running.

Around exactly one hour later, kube-aws exited with error code 2 and showing the following message:

Error: Error updating cluster: ExpiredToken: The security token included in the request is expired
	status code: 403, request id: 901b76e8-814b-11e9-82e1-f33a99ce5b0c

However the upgrading process did not immediately stop on AWS side and I thought it was all going good, but suddently:

The following resource(s) failed to update: [Workers]. 

Received 24 SUCCESS signal(s) out of 37. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

Not sure if I understand the full picture but I guess that if kube-aws handled the token renewal gracefully instead of crashing, it would not happen.

FWIW, I'm using kube-aws v0.9.9 and I am aware that this is pretty old now (1.5 years at the time of writing this) but I have dug into the code and haven't found any change since then to handle expired tokens in the master branch.

Answer 1 · 2019-05-29T13:04:50.000Z

The following workaround worked for me:

In kube-aws, add new nodepools
In AWS, scale up new autoscaling groups to the same value as old autoscaling groups
In AWS, scale down old autoscaling groups to 0, so that workloads get migrated to the new ones (this might take a while)
In kube-aws, remove old nodepools

Answer 2 · 2019-06-24T09:30:21.000Z

Hey @PabloCastellano,

You're right, this looks like it may be an issue still. Want to take a crack at fixing it?

Answer 3 · 2019-06-26T20:47:02.000Z

@dominicgunn I'm happy to help but I need some guidance. Where in the code would you handle the token renewal?

Answer 4 · 2019-06-28T08:19:11.000Z

Sorry for taking a while to get back to you @PabloCastellano,

I'd take a look at awsconn.go, and perhaps cluster.go.

awsconn.go is currently responsible for creating the session, to it may make sense to try and provide some functionality there for ensuring it doesn't expire to keep as much of the session code in one place as possible.

Answer 5 · 2019-09-26T08:29:53.000Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Answer 6 · 2019-09-26T08:44:27.000Z

/remove-lifecycle stale

Answer 7 · 2019-12-25T09:39:32.000Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Answer 8 · 2020-01-24T10:24:50.000Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Answer 9 · 2020-02-23T11:06:32.000Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Answer 10 · 2020-02-23T11:06:40.000Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.