[EKS] [request]: On create: only return ACTIVE when endpoint actually usable
dpiddockcmp opened this issue ยท 15 comments
Community Note
- Please vote on this issue by adding a ๐ reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request
When an EKS cluster is created the API reports an "ACTIVE" status before the endpoint can actually process requests. This means the first few attempts to use new clusters receive connection timeouts. All projects that create clusters have to implement retry logic for the first access to the api, usually when updating the aws-auth ConfigMap.
It would be super useful if the API reported an ACTIVE status on newly created clusters only once the endpoint was actually available to process requests. We've already waited over 10 minutes for the cluster to come up so waiting ~30 seconds more for it to actually be useable wouldn't be a big issue.
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
The Terraform EKS community module is trying to migrate from running kubectl in a shell to using the kubernetes provider for creating the aws-auth ConfigMap. This would help with cross-platform use. Unfortunately due to "ACTIVE" not meaning "USABLE" we've hit issues with chaining the two providers together.
The kubernetes provider itself has refused to implement retry logic on connection timeouts.
Are you currently working around this issue?
Projects that create clusters have some form of retry loop with a sleep.
Additional context
Will potentially make other requests that deal with newly created clusters easier: #185, #254, #51
Attachments
the API reports an "ACTIVE" status before the endpoint can actually process requests
It's super annoying in Terraform and also seems like quite a basic bug. Would love a real fix from AWS.
Thanks for the issue report. We are looking into this further
@mikestef9 Thanks for taking time for this. In fact this is an annoying "bug" and we would like to avoid to try to add some king of buggy retry or wait logic.
This can't be quick fixed by just checking the kubernetes /healthz
, /readyz
(or something else) endpoint before the EKS api return an ACTIVE
status ?
FYI, I opened a PR in the AWS provider to wait for kubernetes endpoint hashicorp/terraform-provider-aws#11426. I don't know if it the right quick win before this issue get solved. Feedbacks are welcome.
@mikestef9 Any updates for this ?
Is this behavior consistently reproducible? Trying to figure out if this is due to race condition in some scenarios but not the others.
@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.
@jqmichael This happens every time I run my TF to spin up a new cluster. I have to run a second time to get the final step run.
We also ran into this issue. Our work around is a few lines of shell script in a provisioner to periodically run curl to check the endpoint. It can be up to a minute or more after AWS reports the cluster as ready that it actually becomes usable.
This isn't the only AWS resource type that we've had to implement these types of work-arounds on.
We definitely need to reproduce this on EKS side. But just curious, did the initial request ever get TCP SYN/ACK back (trying to figure out if the packet gets dropped in the middle or reached apiserver)?
@jqmichael Was you able to reproduce this ? You can use the @dpiddockcmp gist to help you test this quickly https://gist.github.com/dpiddockcmp/23342f3b601b3432b1ea98ab61af6ba0
We narrowed it down to the propagation delay in Network Load Balancer(NLB) dataplane after AutoScalingGroup registers targets. NLB team is launching a campaign to reduce the propagation delay later this year. But until that campaign is finished, the work-around is to retry on client side until the traffic goes through.
This also affects terminating instances; APIServers' on terminating control plane instances still serve requests because of (what seems to be) a de-registration delay in the NLB.
Does this issue happen on the upgrading cluster? In our case, we manage an EKS cluster by CloudFormation, and we sometimes encountered a communication with an API server is unstable immediately after upgrading, even if a stack status is UPDATE_COMPLETE.
any updates on this?
are there any workarounds for this until we find a fix?