giantswarm/aws-operator

Idle Timeout for Elastic Load Balancers should be configurable

hobti01 opened this issue · 8 comments

When using Helm/Tiller, deployments make take longer than the 60 second default timeout when communicating with the API Server. While the precise issue is related to the API Server, it is logical to allow configuration of all ELBs.

@hobti01 Thanks so much for raising this and the PRs!

Yes, I can see how the timeout could affect Helm deployments. I'll discuss this with the team and get back to you.

Ping @puja108 @fgimenez

r7vme commented

Looks like this also caused calico (confd) issues. If cluster was scaled new workers can not properly join BGP peers, because confd missed etcd events and did not reconfigure bird. We don't see this issue in on-prem guest clusters, because load-balancer there don't have this kind of timeout or has really long one (e.g. multiple hours).

r7vme commented

etcd uses other load-balancer, not one with ingress, so I'll create other issue.

The aws-operator change is deployed. We still need to set the timeout values in the cluster custom object. I'd propose

  • API - 300 seconds
  • Etcd - 3600 seconds
  • Ingress - 300 seconds

Setting etcd to 3600 secs will resolve #464 raised by Roman.

@r7vme @calvix Are you OK with these values?

@hobti01 Is 300 secs high enough for apiserver to resolve your Helm problems?

r7vme commented

Etcd 3600 is OK, but i'm not sure about others.

AWS idle timeout is last resort for dropping stuck connections (you have also kernel TCP stuff, application level logic.). From one side having short idle timeout it can save us from some attacks. From the other side API has a lot of functionality that uses long-living connections (e.g. watches, logs, execs ).

I've checked google for kind of "best practices" for k8s api. Only found that DEIS recommend to use 1200 sec. So from my side i think it also makes sense to start with 1200sec for API and Ingress.

@r7vme Thanks, OK let's go with 1200 for api and ingress. I'll update kubernetesd to set these values.

@rossf7 deploying elasticstack with master and data nodes exceeds 300 seconds. Right now we are using a helm timeout of 600 and api server timeout of 900 which is ok. We'd be happy with defaults of 1200.

The current aws-operator seems to keep resetting the timeout to 60 seconds ;)

@hobti01 Yes I'm afraid it will be resetting to 60 secs because the idle timeouts are not set in the cluster custom object.

Once the kubernetesd change is made the timeouts will be set for new clusters. I'll check if we can set the timeouts for existing clusters.