kube-aws-0.14.2 Unable to create stack with cluster-autoscaler enabled
flah00 opened this issue · 11 comments
I enabled the cluster-autoscaler and changed the region to us-east-1, since the cluster would be running there. It seems like the call to the AWS autoscaler API times out and leads to a failure to finish creating the cluster (send request failed caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
). I am able to connect to the end point, from the controller node, but the container cannot, for some reason.
cluster.yaml cluster-autoscaler config
clusterAutoscaler:
enabled: true
replicas: 2
# NOTE: I changed this value, to reflect the region the cluster would run in
region: "us-east-1"
# image: "k8s.gcr.io/cluster-autoscaler:v1.13.4"
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
prometheusMetrics:
enabled: true
interval: "10s"
namespace: monitoring
selector:
prometheus: monitoring
cluster-autoscaler docker container logs
I1120 15:53:29.453120 1 flags.go:52] FLAG: --address=":8085"
I1120 15:53:29.453148 1 flags.go:52] FLAG: --alsologtostderr="false"
I1120 15:53:29.453154 1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I1120 15:53:29.453159 1 flags.go:52] FLAG: --cloud-config=""
I1120 15:53:29.453164 1 flags.go:52] FLAG: --cloud-provider="aws"
I1120 15:53:29.453169 1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I1120 15:53:29.453178 1 flags.go:52] FLAG: --cluster-name=""
I1120 15:53:29.453182 1 flags.go:52] FLAG: --cores-total="0:320000"
I1120 15:53:29.453187 1 flags.go:52] FLAG: --estimator="binpacking"
I1120 15:53:29.453192 1 flags.go:52] FLAG: --expander="least-waste"
I1120 15:53:29.453197 1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I1120 15:53:29.453210 1 flags.go:52] FLAG: --gke-api-endpoint=""
I1120 15:53:29.453214 1 flags.go:52] FLAG: --gpu-total="[]"
I1120 15:53:29.453219 1 flags.go:52] FLAG: --httptest.serve=""
I1120 15:53:29.453224 1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I1120 15:53:29.453229 1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I1120 15:53:29.453234 1 flags.go:52] FLAG: --kubeconfig=""
I1120 15:53:29.453239 1 flags.go:52] FLAG: --kubernetes=""
I1120 15:53:29.453243 1 flags.go:52] FLAG: --leader-elect="true"
I1120 15:53:29.453255 1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I1120 15:53:29.453262 1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I1120 15:53:29.453268 1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I1120 15:53:29.453274 1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I1120 15:53:29.453279 1 flags.go:52] FLAG: --log-backtrace-at=":0"
I1120 15:53:29.453288 1 flags.go:52] FLAG: --log-dir=""
I1120 15:53:29.453294 1 flags.go:52] FLAG: --log-file=""
I1120 15:53:29.453298 1 flags.go:52] FLAG: --logtostderr="true"
I1120 15:53:29.453303 1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I1120 15:53:29.453308 1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I1120 15:53:29.453313 1 flags.go:52] FLAG: --max-failing-time="15m0s"
I1120 15:53:29.453318 1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I1120 15:53:29.453323 1 flags.go:52] FLAG: --max-inactivity="10m0s"
I1120 15:53:29.453327 1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I1120 15:53:29.453332 1 flags.go:52] FLAG: --max-nodes-total="0"
I1120 15:53:29.453337 1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I1120 15:53:29.453342 1 flags.go:52] FLAG: --memory-total="0:6400000"
I1120 15:53:29.453347 1 flags.go:52] FLAG: --min-replica-count="0"
I1120 15:53:29.453351 1 flags.go:52] FLAG: --namespace="kube-system"
I1120 15:53:29.453356 1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I1120 15:53:29.453361 1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I1120 15:53:29.453366 1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/k8s-dev]"
I1120 15:53:29.453374 1 flags.go:52] FLAG: --nodes="[]"
I1120 15:53:29.453379 1 flags.go:52] FLAG: --ok-total-unready-count="3"
I1120 15:53:29.453384 1 flags.go:52] FLAG: --regional="false"
I1120 15:53:29.453389 1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I1120 15:53:29.453393 1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I1120 15:53:29.453399 1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I1120 15:53:29.453403 1 flags.go:52] FLAG: --scale-down-delay-after-delete="10s"
I1120 15:53:29.453408 1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I1120 15:53:29.453413 1 flags.go:52] FLAG: --scale-down-enabled="true"
I1120 15:53:29.453418 1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I1120 15:53:29.453423 1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I1120 15:53:29.453428 1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I1120 15:53:29.453432 1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
I1120 15:53:29.453437 1 flags.go:52] FLAG: --scan-interval="10s"
I1120 15:53:29.453442 1 flags.go:52] FLAG: --skip-headers="false"
I1120 15:53:29.453447 1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I1120 15:53:29.453452 1 flags.go:52] FLAG: --skip-nodes-with-system-pods="false"
I1120 15:53:29.453456 1 flags.go:52] FLAG: --stderrthreshold="0"
I1120 15:53:29.453461 1 flags.go:52] FLAG: --test.bench=""
I1120 15:53:29.453466 1 flags.go:52] FLAG: --test.benchmem="false"
I1120 15:53:29.453471 1 flags.go:52] FLAG: --test.benchtime="1s"
I1120 15:53:29.453476 1 flags.go:52] FLAG: --test.blockprofile=""
I1120 15:53:29.453480 1 flags.go:52] FLAG: --test.blockprofilerate="1"
I1120 15:53:29.453485 1 flags.go:52] FLAG: --test.count="1"
I1120 15:53:29.453490 1 flags.go:52] FLAG: --test.coverprofile=""
I1120 15:53:29.453495 1 flags.go:52] FLAG: --test.cpu=""
I1120 15:53:29.453499 1 flags.go:52] FLAG: --test.cpuprofile=""
I1120 15:53:29.453504 1 flags.go:52] FLAG: --test.failfast="false"
I1120 15:53:29.453509 1 flags.go:52] FLAG: --test.list=""
I1120 15:53:29.453513 1 flags.go:52] FLAG: --test.memprofile=""
I1120 15:53:29.453518 1 flags.go:52] FLAG: --test.memprofilerate="0"
I1120 15:53:29.453522 1 flags.go:52] FLAG: --test.mutexprofile=""
I1120 15:53:29.453527 1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I1120 15:53:29.453532 1 flags.go:52] FLAG: --test.outputdir=""
I1120 15:53:29.453536 1 flags.go:52] FLAG: --test.parallel="2"
I1120 15:53:29.453541 1 flags.go:52] FLAG: --test.run=""
I1120 15:53:29.453546 1 flags.go:52] FLAG: --test.short="false"
I1120 15:53:29.453550 1 flags.go:52] FLAG: --test.testlogfile=""
I1120 15:53:29.453555 1 flags.go:52] FLAG: --test.timeout="0s"
I1120 15:53:29.453560 1 flags.go:52] FLAG: --test.trace=""
I1120 15:53:29.453565 1 flags.go:52] FLAG: --test.v="false"
I1120 15:53:29.453569 1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I1120 15:53:29.453574 1 flags.go:52] FLAG: --v="4"
I1120 15:53:29.453579 1 flags.go:52] FLAG: --vmodule=""
I1120 15:53:29.453584 1 flags.go:52] FLAG: --write-status-configmap="true"
I1120 15:53:29.453593 1 main.go:333] Cluster Autoscaler 1.13.4
I1120 15:53:29.551968 1 leaderelection.go:205] attempting to acquire leader lease kube-system/cluster-autoscaler...
I1120 15:53:29.563459 1 leaderelection.go:214] successfully acquired lease kube-system/cluster-autoscaler
I1120 15:53:29.563999 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"f195cf23-0b29-11ea-80ef-120e3f667f3d", APIVersion:"v1", ResourceVersion:"93198", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-8495648c6f-p2s7f became leader
I1120 15:53:29.565263 1 reflector.go:131] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I1120 15:53:29.565295 1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I1120 15:53:29.565412 1 reflector.go:131] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I1120 15:53:29.565428 1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I1120 15:53:29.565490 1 reflector.go:131] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1120 15:53:29.565502 1 reflector.go:169] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1120 15:53:29.565580 1 reflector.go:131] Starting reflector *v1beta1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I1120 15:53:29.565591 1 reflector.go:169] Listing and watching *v1beta1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I1120 15:53:29.565632 1 reflector.go:131] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I1120 15:53:29.565643 1 reflector.go:169] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I1120 15:53:29.565714 1 reflector.go:131] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:266
I1120 15:53:29.565723 1 reflector.go:169] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:266
I1120 15:53:29.753923 1 predicates.go:122] Using predicate PodFitsResources
I1120 15:53:29.754001 1 predicates.go:122] Using predicate GeneralPredicates
I1120 15:53:29.754024 1 predicates.go:122] Using predicate PodToleratesNodeTaints
I1120 15:53:29.754041 1 predicates.go:122] Using predicate CheckNodeUnschedulable
I1120 15:53:29.754057 1 predicates.go:122] Using predicate CheckVolumeBinding
I1120 15:53:29.754077 1 predicates.go:122] Using predicate MaxAzureDiskVolumeCount
I1120 15:53:29.754095 1 predicates.go:122] Using predicate MaxEBSVolumeCount
I1120 15:53:29.754112 1 predicates.go:122] Using predicate NoDiskConflict
I1120 15:53:29.754128 1 predicates.go:122] Using predicate NoVolumeZoneConflict
I1120 15:53:29.754163 1 predicates.go:122] Using predicate ready
I1120 15:53:29.754193 1 predicates.go:122] Using predicate MatchInterPodAffinity
I1120 15:53:29.754213 1 predicates.go:122] Using predicate MaxCSIVolumeCountPred
I1120 15:53:29.754229 1 predicates.go:122] Using predicate MaxGCEPDVolumeCount
I1120 15:53:29.754246 1 cloud_provider_builder.go:29] Building aws cloud provider.
I1120 15:53:29.849995 1 reflector.go:131] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.850175 1 reflector.go:169] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.850636 1 reflector.go:131] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.850740 1 reflector.go:169] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.851101 1 reflector.go:131] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.851202 1 reflector.go:169] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.851585 1 reflector.go:131] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.851695 1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.852053 1 reflector.go:131] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.852166 1 reflector.go:169] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.853043 1 reflector.go:131] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.853080 1 reflector.go:169] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.853607 1 reflector.go:131] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.853628 1 reflector.go:169] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.854295 1 reflector.go:131] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.854315 1 reflector.go:169] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.950478 1 reflector.go:131] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.950513 1 reflector.go:169] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.950822 1 reflector.go:131] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:132
I1120 15:53:29.950849 1 reflector.go:169] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:132
I1120 15:53:30.050048 1 request.go:530] Throttling request took 99.060131ms, request: GET:https://10.32.0.1:443/api/v1/replicationcontrollers?limit=500&resourceVersion=0
E1120 15:55:30.264933 1 aws_manager.go:148] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
F1120 15:55:30.264963 1 aws_cloud_provider.go:335] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
curl -v https://autoscaling.us-east-1.amazonaws.com
master|k8s-dev core@ip-10-30-45-105 ~ $ curl -v https://autoscaling.us-east-1.amazonaws.com
* Trying 72.21.206.37:443...
...
> GET / HTTP/1.1
> Host: autoscaling.us-east-1.amazonaws.com
> User-Agent: curl/7.65.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< x-amzn-RequestId: 5ca19855-0bbd-11ea-8271-35f3f8399b12
< Location: http://aws.amazon.com/autoscaling
< Content-Length: 0
< Date: Wed, 20 Nov 2019 17:44:10 GMT
<
To create the stack, I had to disable cluster autoscaler. Then I edited plugins/cluster-autoscaler/servicemonitor.yaml
, outdenting line 9, enabled the cluster autoscaler, in cluster.yaml, ran kube-aws apply
, and ran into a new error
Nov 23 21:57:29 ip-10-30-45-159.ec2.internal retry[25602]: error: unable to recognize "/srv/kube-aws/plugins/cluster-autoscaler/servicemonitor.yaml": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
@flah00 if you're not using prometheus operator then disable prometheusMetrics.enabled
in the autoscaler plugin as kubernetes will not recognise this custom resource. This should fix the issue
I disabled prometheus metrics, which allowed me to get past the previous error. But I'm now running into an issue where the taints are preventing one of the two cluster-auto-scaling pods from starting.
kubectl describe po -n kube-system cluster-auto-scaler-....
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.alpha.kubernetes.io/role=master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m22s (x389 over 9h) default-scheduler 0/3 nodes are available: 1 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.
Normal NotTriggerScaleUp 5m21s (x3317 over 9h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
Normal NotTriggerScaleUp 10s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
I don't see a way to add taints to cluster-autoscaler atm. If you don't have a node pool without taints it will be difficult for certain control plane resources to find a way as some of them are excluded from masters. Try removing the taints or adding a node pool without it.
A PR to support this in the cluster autoscaler plugin should be something easy to do. Will try to find some time
Maybe I'm reading the taints incorrectly, but it seems to me that the autoscaler is configured to only run on controller instances. The plugin also sets the number of replica pods to two. One of the autoscaler pods is running on a master, but for some reason the second pod is unable to run on the second master.
kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-30-45-243.ec2.internal Ready master 4d21h v1.14.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1e,kube-aws.coreos.com/role=controller,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-30-45-243.ec2.internal,kubernetes.io/os=linux,kubernetes.io/role=master,node-role.kubernetes.io/master=,node.kubernetes.io/role=master,service-cidr=10.32.0.0_24
ip-10-30-45-65.ec2.internal Ready master 12d v1.14.8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1e,kube-aws.coreos.com/role=controller,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-30-45-65.ec2.internal,kubernetes.io/os=linux,kubernetes.io/role=master,node-role.kubernetes.io/master=,node.kubernetes.io/role=master,service-cidr=10.32.0.0_24
kubectl get po -n kube-system -o wide | grep -E 'autosc|[A-Z][A-Z]'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cluster-autoscaler-7dd84c664c-k7g85 0/1 Pending 0 3d21h <none> <none> <none> <none>
cluster-autoscaler-7dd84c664c-lrgsb 1/1 Running 1 12d 10.33.8.5 ip-10-30-45-243.ec2.internal <none> <none>
One last thing I noticed, we're using kubernetes 1.14.8, but the plugin defaults to 1.13.x. I believe the autoscaler docs encourage users to use the same versions of k8s and the autoscaler... so I think we should default to cluster-autoscaler:v1.14.6
as the image.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.