Cluster Autoscaler unable to scale up nodes from EKS autoscaling group warm pool when pods request ephemeral storage
David-Tamrazov opened this issue ยท 26 comments
Is this a BUG REPORT or FEATURE REQUEST?:
A bug report.
What happened:
Cluster Autoscaler is unable to scale up nodes from warm pools of an InstanceGroup
with the eks
provisioner when the unscheduled pods request additional ephemeral storage:
I0215 16:21:32.873817 1 klogx.go:86] Pod nodegroup-runners/iris-kdr5j-mhntj is unschedulable
I0215 16:21:32.873855 1 scale_up.go:376] Upcoming 1 nodes
I0215 16:21:32.873910 1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0215 16:21:32.873939 1 scale_up.go:449] No pod can fit to eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7
I0215 16:21:32.873989 1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on tm-github-runners-cluster-nodegroup-runners-iris, predicate checking error: Insufficient ephemeral-storage; predicateName=NodeResourcesFit; reasons: Insufficient ephemeral-storage; debugInfo=
I0215 16:21:32.874023 1 scale_up.go:449] No pod can fit to tm-github-runners-cluster-nodegroup-runners-iris
I0215 16:21:32.874050 1 scale_up.go:453] No expansion options
This is the configured InstanceGroup
in question:
apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
name: iris
namespace: nodegroup-runners
annotations:
instancemgr.keikoproj.io/cluster-autoscaler-enabled: 'true'
spec:
strategy:
type: rollingUpdate
rollingUpdate:
maxUnavailable: 5
provisioner: eks
eks:
minSize: 1
maxSize: 10
warmPool:
minSize: 0
maxSize: 10
configuration:
labels:
workload: runners
keyPairName: iris-github-runner-keypair
clusterName: tm-github-runners-cluster
image: ami-0778893a848813e52
instanceType: c6i.2xlarge
I'm attempting to deploy pods via this RunnerDeployment
from the actions-runner-controller. If I remove the ephemeral-storage
requests & limits, then Cluster Autoscaler is able to scale up nodes from the warm pool as expected.
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: iris
namespace: nodegroup-runners
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values:
- runners
ephemeral: true
repository: MyOrg/iris
labels:
- self-hosted
dockerEnabled: false
image: ghcr.io/my-org/self-hosted-runners/iris:v7
imagePullSecrets:
- name: github-container-registry
containers:
- name: runner
imagePullPolicy: IfNotPresent
env:
- name: RUNNER_FEATURE_FLAG_EPHEMERAL
value: "true"
resources:
requests:
cpu: "1.0"
memory: "2Gi"
ephemeral-storage: "10Gi"
limits:
cpu: "2.0"
memory: "4Gi"
ephemeral-storage: "10Gi"
What you expected to happen:
For Cluster Autoscaler to be able to scale-in nodes from the warm pool when there's unscheduled pods that request ephemeral storage.
How to reproduce it (as minimally and precisely as possible):
-
Deploy the above instance group (using your own subnet, cluster, security groups) into a cluster with Cluster Autoscaler in it
-
Deploy any pods with the appropriate node affinity and request limits for ephemeral storage.
-
Check cluster autoscaler logs and note the failure to scale up
Environment:
- Kubernetes version: 1.21
$ kubectl version -o yaml
clientVersion:
buildDate: "2021-08-19T15:45:37Z"
compiler: gc
gitCommit: 632ed300f2c34f6d6d15ca4cef3d3c7073412212
gitTreeState: clean
gitVersion: v1.22.1
goVersion: go1.16.7
major: "1"
minor: "22"
platform: darwin/amd64
serverVersion:
buildDate: "2021-10-29T23:32:16Z"
compiler: gc
gitCommit: 5236faf39f1b7a7dabea8df12726f25608131aa9
gitTreeState: clean
gitVersion: v1.21.5-eks-bc4871b
goVersion: go1.16.8
major: "1"
minor: 21+
platform: linux/amd64
Other debugging information (if applicable):
None available as we've deprovisioned our test setup for now but if need be we can reproduce and post additional logs here.
Thanks for filing this, although this seem to be a cluster-autoscaler issue.. does it work without warm pools? e.g. generic auto-scaling based on ephemeral storage?
I'm not sure if there is something we can do on the instance-manager side to support this.
I think this is specifically a 'scale from zero' problem - where CA uses only the tags on the ASG to determine the capacity of a potential new node. I believe this is possible to implement in Instance-Manager - we'd have to inspect the volumes attached to the IG, determine which volume is the volume used for ephemeral storage, and tag the ASG accordingly.
@backjo definitely could be..
@David-Tamrazov could you confirm if this is the case? if min is not 0, does it work?
Let me give it a spin! I did indeed run into that issue before, hence the min: 1
setting on the instance group itself, but it didn't occur to me that min: 0
on the warm pool might be a problem.
re: this being a cluster autoscaler problem, it fully could be. I only figured to post here since I was able to get the Cluster Autoscaler to pull in nodes from an EKS Managed Nodegroup provisioned through CDK for the same pods without issue, so I figured there might be something in the InstanceGroup
setup thats missing.
I'll try the following and report back:
- autoscaling for pods that need ephemeral storage without warm pools
- autoscaling for pods that need ephemeral storage with warm pools and
minSize
set to 1 on the warm pool
Would also be good to experiment with tagging the ASG via configuration.tags
with k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: "xGi"
(replace x with the volume size on your node)
Adding the tag definitely got the autoscaler to pick register the autoscaling group as a suitable candidate, great tip:
I0215 22:23:04.427911 1 static_autoscaler.go:319] 2 unregistered nodes present
I0215 22:23:04.428042 1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 22:23:04.428076 1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 22:23:04.428123 1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-0. Ignoring in scale up.
I0215 22:23:04.428168 1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-4gncr marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-1. Ignoring in scale up.
I0215 22:23:04.428185 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0215 22:23:04.428193 1 filter_out_schedulable.go:171] 2 pods marked as unschedulable can be scheduled.
I0215 22:23:04.428204 1 filter_out_schedulable.go:79] Schedulable pods present
I0215 22:23:04.428228 1 static_autoscaler.go:401] No unschedulable pods
However the pod just sits there unscheduled now; there's something happening with the kube-scheduler that's preventing it from placing the Pod on the node I think:
$ kubectl describe pod iris-4v89d-registration-only -n nodegroup-runners
Name: iris-4v89d-registration-only
Namespace: nodegroup-runners
Priority: 0
Node: <none>
Labels: pod-template-hash=749fd4569f
Annotations: actions-runner-controller/registration-only: true
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: Runner/iris-4v89d-registration-only
Containers:
runner:
Image: ghcr.io/myorg/self-hosted-runners/iris:v7
Port: <none>
Host Port: <none>
Limits:
cpu: 2
ephemeral-storage: 10Gi
memory: 4Gi
Requests:
cpu: 1
ephemeral-storage: 10Gi
memory: 2Gi
Environment:
RUNNER_FEATURE_FLAG_EPHEMERAL: true
RUNNER_ORG:
RUNNER_REPO: MyOrg/iris
RUNNER_ENTERPRISE:
RUNNER_LABELS: self-hosted
RUNNER_GROUP:
DOCKERD_IN_RUNNER: false
GITHUB_URL: https://github.com/
RUNNER_WORKDIR: /runner/_work
RUNNER_EPHEMERAL: true
RUNNER_REGISTRATION_ONLY: true
RUNNER_NAME: iris-4v89d-registration-only
RUNNER_TOKEN: <--REDACTED-->
Mounts:
/runner from runner (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bz7lp (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
runner:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-bz7lp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 52s (x6 over 4m56s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.
If I run kubectl get nodes
I see that the nodes from the autoscaling group don't appear in the list, so I'm not sure if the scheduler is even aware that these nodes exist. I even see 2 unregistered nodes present
in the autoscaler logs which could be hinting at the same issue:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-1-19-186.ec2.internal Ready <none> 4d2h v1.21.5-eks-9017834
ip-10-1-4-225.ec2.internal Ready <none> 4d2h v1.21.5-eks-9017834
^ might be hard to see but you can tell the nodes listed by kubectl get nodes
aren't the same ones from the autoscaling group by their names and the age (the other 2 nodes are 4 days old; these new ones I just created)
Nothing from instance-manager
logs jumps out at me as problematic:
2022-02-15T22:31:08.609Z INFO v1alpha1 state transition occured {"instancegroup": "nodegroup-runners/iris", "state": "ReconcileModifying", "previousState": "InitUpdate"}
2022-02-15T22:31:08.609Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:08.609Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:08.610Z INFO controllers.instancegroup.eks updated managed policies {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.610Z INFO controllers.instancegroup.eks reconciled managed role {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.610Z INFO scaling drift not detected {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:08.610Z INFO controllers.instancegroup.eks bootstrapping arn to aws-auth {"instancegroup": "nodegroup-runners/iris", "arn": "arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.628Z INFO controllers.instancegroup.eks waiting for node readiness conditions {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:08.629Z INFO controllers.instancegroup.eks desired nodes are not ready {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}
2022-02-15T22:31:08.629Z INFO controllers.instancegroup reconcile event ended with requeue {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:08.630Z INFO controllers.instancegroup patching resource status {"instancegroup": "nodegroup-runners/iris", "patch": "{}", "resourceVersion": "5028928"}
2022-02-15T22:31:18.641Z INFO controllers.instancegroup reconcile event started {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:18.642Z INFO v1alpha1 state transition occured {"instancegroup": "nodegroup-runners/iris", "state": "Init", "previousState": "ReconcileModifying"}
2022-02-15T22:31:18.642Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLaunchConfigurations"}
2022-02-15T22:31:18.731Z DEBUG aws-provider AWS API call {"cacheHit": false, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:18.771Z DEBUG aws-provider AWS API call {"cacheHit": false, "service": "iam", "operation": "ListAttachedRolePolicies"}
2022-02-15T22:31:18.816Z DEBUG aws-provider AWS API call {"cacheHit": false, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:18.818Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "autoscaling", "operation": "DescribeAutoScalingGroups"}
2022-02-15T22:31:18.819Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "eks", "operation": "DescribeCluster"}
2022-02-15T22:31:18.870Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:18.926Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:18.979Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.062Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.115Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.122Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLifecycleHooks"}
2022-02-15T22:31:19.123Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLaunchConfigurations"}
2022-02-15T22:31:19.123Z INFO v1alpha1 state transition occured {"instancegroup": "nodegroup-runners/iris", "state": "InitUpdate", "previousState": "Init"}
2022-02-15T22:31:19.123Z INFO v1alpha1 state transition occured {"instancegroup": "nodegroup-runners/iris", "state": "ReconcileModifying", "previousState": "InitUpdate"}
2022-02-15T22:31:19.123Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:19.124Z DEBUG aws-provider AWS API call {"cacheHit": true, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:19.124Z INFO controllers.instancegroup.eks updated managed policies {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.124Z INFO controllers.instancegroup.eks reconciled managed role {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.124Z INFO scaling drift not detected {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:19.124Z INFO controllers.instancegroup.eks bootstrapping arn to aws-auth {"instancegroup": "nodegroup-runners/iris", "arn": "arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.143Z INFO controllers.instancegroup.eks waiting for node readiness conditions {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:19.143Z INFO controllers.instancegroup.eks desired nodes are not ready {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}
2022-02-15T22:31:19.143Z INFO controllers.instancegroup reconcile event ended with requeue {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:19.143Z INFO controllers.instancegroup patching resource status {"instancegroup": "nodegroup-runners/iris", "patch": "{}", "resourceVersion": "5028928"}
Great, if nodes scale out now, we have an approach for adding support for autoscaler as part of the cluster-autoscaler-enabled
annotation. (thanks @backjo ๐ )
If you now have a scheduling issue, can you compare the affinity to make sure it's matching e.g. pod affinity vs. node labels.
The pod describe you added shows Node-Selectors: <none>
so maybe something is not going through there
I think thats because i'm not explicitly defining a nodeSelector
on the pod; in other contexts (i.e. with managed nodegroups via CDK), the affinity works as expected between my nodes and the RunnerDeployment
described above. I believe it matches here as well since the autoscaler see's it doesn't match for other nodes in my cluster, but it does consider the pod affinity to match the AutoScalingGroup from the InstanceGroup
.
No match with other nodes in the cluster:
I0215 22:03:53.173980 1 klogx.go:86] Pod nodegroup-runners/iris-4v89d-4gncr is unschedulable
I0215 22:03:53.174135 1 scale_up.go:376] Upcoming 1 nodes
I0215 22:03:53.174197 1 scale_up.go:300] Pod iris-4v89d-4gncr can't be scheduled on eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0215 22:03:53.174227 1 scale_up.go:449] No pod can fit to eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7
Successful match for the template nodes from the autoscaling group warmpool:
I0215 22:23:04.427911 1 static_autoscaler.go:319] 2 unregistered nodes present
I0215 22:23:04.428042 1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 22:23:04.428076 1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 22:23:04.428123 1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-0. Ignoring in scale up.
I0215 22:23:04.428168 1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-4gncr marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-1. Ignoring in scale up.
I can also see the necessary tags on the ASG:
I can't really check the nodes because they're not accessible via kube API; it seems to be unaware of them and I don't think they're registered with the cluster.
Can you confirm you see the affinity on the pod spec for the pods created by the RunnerDeployment?
I think the ASG/Cluster Autoscaler part is fine, nodes come up when pods are pending, but for some reason cannot schedule when node comes up.. so in that case it's either the label is not on the node (which it seems to be) or the affinity is not on the pod?
Yep can confirm the affinity is on the pod spec. Im just thinking, how come there wouldn't be logs for the missing node label if the scheduler found the autoscaling nodes and saw there were no labels present? The current 0/2 nodes are available
refers to a managed nodegroup I deployed with the cluster to run system controller, including instance-manager
. If the scheduler saw the ASG nodes and found the label missing, it then the log ought to be 0/4 nodes are available
:
$ kubectl get pod iris-4v89d-registration-only -oyaml -n nodegroup-runners
apiVersion: v1
kind: Pod
metadata:
annotations:
actions-runner-controller/registration-only: "true"
kubernetes.io/psp: eks.privileged
creationTimestamp: "2022-02-15T22:51:34Z"
labels:
pod-template-hash: 749fd4569f
name: iris-4v89d-registration-only
namespace: nodegroup-runners
ownerReferences:
- apiVersion: actions.summerwind.dev/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Runner
name: iris-4v89d-registration-only
uid: 7a0c2b6e-51d6-42cd-8dd4-47d972440550
resourceVersion: "5041103"
uid: 67e81669-576f-4f09-b09a-326824750ae9
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values:
- runners
containers:
- env:
- name: RUNNER_FEATURE_FLAG_EPHEMERAL
value: "true"
- name: RUNNER_ORG
- name: RUNNER_REPO
value: MyOrg/iris
- name: RUNNER_ENTERPRISE
- name: RUNNER_LABELS
value: self-hosted
- name: RUNNER_GROUP
- name: DOCKERD_IN_RUNNER
value: "false"
- name: GITHUB_URL
value: https://github.com/
- name: RUNNER_WORKDIR
value: /runner/_work
- name: RUNNER_EPHEMERAL
value: "true"
- name: RUNNER_REGISTRATION_ONLY
value: "true"
- name: RUNNER_NAME
value: iris-4v89d-registration-only
- name: RUNNER_TOKEN
value: <--Redacted-->
image: ghcr.io/myorg/self-hosted-runners/iris:v7
imagePullPolicy: IfNotPresent
name: runner
resources:
limits:
cpu: "2"
ephemeral-storage: 10Gi
memory: 4Gi
requests:
cpu: "1"
ephemeral-storage: 10Gi
memory: 2Gi
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /runner
name: runner
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-f8gnv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: github-container-registry
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir: {}
name: runner
- name: kube-api-access-f8gnv
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-02-15T22:51:34Z"
message: '0/2 nodes are available: 2 node(s) didn''t match Pod''s node affinity/selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
One possibility is that this is still is a ClusterAutoscaler issue and its not bringing in nodes from warm pools correctly; it spins them up but then doesn't register them with the cluster for some reason. Could be wrong though.
The warm pool instances are not supposed to be part of the cluster, they are spun up and shut-down while in the warm pool. Later, when autoscaling happen instead of spinning up new instances, instances from the warm pool simply move to the 'live' pool of instances and are powered on. The 'faster' scaling here is because we only spend time waiting for the instance to boot instead of boot+provisioned.
As long as you see the instances moving from the warm pool to the main pool when scaling happens this part should be fine.. you can also test the same thing without warm pools to exclude it.
Went over the pictures you added again - are you saying the instances that are spun up in the live pool are not joining the cluster? The instances in the live pool should definitely join the cluster, so there could be another issue here
Went over the pictures you added again - are you saying the instances that are spun up in the live pool are not joining the cluster? The instances in the live pool should definitely join the cluster, so there could be another issue here
Yeah none of the instances from the InstanceGroup
are part of the cluster or appear when i run kubectl get nodes
, nor is the one live instance I spin up by default with minSize: 1
in the instance group configuration (not the warm pool configuration) able to have a pod be scheduled onto it.
I'm currently redeploying without a warm pool to rule out the warm pool itself.
So I removed the warm pool; only deployed the InstanceGroup
; the live node is still not listed when I run kubectl get nodes
:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-1-19-186.ec2.internal Ready <none> 4d3h v1.21.5-eks-9017834
ip-10-1-4-225.ec2.internal Ready <none> 4d3h v1.21.5-eks-9017834
I see the instance in the ASG running and healthy. Screenshot from the ASG console:
EKS console screenshot:
InstanceGroup spec:
$ kubectl describe instancegroup iris -n nodegroup-runners
Name: iris
Namespace: nodegroup-runners
Labels: <none>
Annotations: instancemgr.keikoproj.io/cluster-autoscaler-enabled: true
API Version: instancemgr.keikoproj.io/v1alpha1
Kind: InstanceGroup
Metadata:
Creation Timestamp: 2022-02-15T23:27:09Z
Finalizers:
finalizer.instancegroups.keikoproj.io
Generation: 1
Managed Fields:
API Version: instancemgr.keikoproj.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:instancemgr.keikoproj.io/cluster-autoscaler-enabled:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:eks:
.:
f:configuration:
.:
f:clusterName:
f:image:
f:instanceType:
f:keyPairName:
f:labels:
.:
f:workload:
f:securityGroups:
f:subnets:
f:tags:
f:maxSize:
f:minSize:
f:provisioner:
f:strategy:
.:
f:rollingUpdate:
.:
f:maxUnavailable:
f:type:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-02-15T23:27:09Z
API Version: instancemgr.keikoproj.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:"finalizer.instancegroups.keikoproj.io":
f:status:
.:
f:activeLaunchConfigurationName:
f:activeScalingGroupName:
f:conditions:
f:currentMax:
f:currentMin:
f:currentState:
f:lifecycle:
f:nodesInstanceRoleArn:
f:provisioner:
f:strategy:
Manager: manager
Operation: Update
Time: 2022-02-15T23:28:49Z
Resource Version: 5053640
UID: b3b8c7ac-004e-4ffb-ad5e-6b2b94a8e183
Spec:
Eks:
Configuration:
Cluster Name: tm-github-runners-cluster
Image: ami-0778893a848813e52
Instance Type: c6i.2xlarge
Key Pair Name: iris-github-runner-keypair
Labels:
Workload: runners
Security Groups:
sg-0e8839d84fa962a0e
Subnets:
subnet-fad141a3
subnet-01da7c2d4ac7c3da7
subnet-fdd141a4
subnet-03400f942826b18bb
Tags:
Key: k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage
Value: 10Gi
Max Size: 10
Min Size: 1
Provisioner: eks
Strategy:
Rolling Update:
Max Unavailable: 5
Type: rollingUpdate
Status:
Active Launch Configuration Name: tm-github-runners-cluster-nodegroup-runners-iris-20220215232745
Active Scaling Group Name: tm-github-runners-cluster-nodegroup-runners-iris
Conditions:
Status: False
Type: NodesReady
Current Max: 10
Current Min: 1
Current State: ReconcileModifying
Lifecycle: normal
Nodes Instance Role Arn: arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris
Provisioner: eks
Strategy: rollingUpdate
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal InstanceGroupCreated 5m17s {"instancegroup":"nodegroup-runners/iris","msg":"instance group has been successfully created","scalinggroup":"tm-github-runners-cluster-nodegroup-runners-iris"}
Autoscaler logs - I'm seeing 1 unregistered nodes present
now that there's no warm pool instance scaled up. I'm fairly convinced its talking about the 1 live InstanceGroup
instance we have now - this same log read 2 unregistered nodes present
with the 1 live instance + 1 warm pool instance before:
I0215 23:37:42.277292 1 static_autoscaler.go:228] Starting main loop
I0215 23:37:42.373011 1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7 tm-github-runners-cluster-nodegroup-runners-iris]
I0215 23:37:42.425687 1 auto_scaling.go:199] 2 launch configurations already in cache
I0215 23:37:42.425853 1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-02-15 23:38:42.42584869 +0000 UTC m=+89744.527237746
I0215 23:37:42.426168 1 aws_manager.go:315] Found multiple availability zones for ASG "tm-github-runners-cluster-nodegroup-runners-iris"; using us-east-1a for failure-domain.beta.kubernetes.io/zone label
I0215 23:37:42.426890 1 static_autoscaler.go:319] 1 unregistered nodes present
I0215 23:37:42.426971 1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 23:37:42.426982 1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 23:37:42.427015 1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-ksvwg-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-3295993556649179818-upcoming-0. Ignoring in scale up.
I0215 23:37:42.427023 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0215 23:37:42.427051 1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I0215 23:37:42.427058 1 filter_out_schedulable.go:79] Schedulable pods present
I0215 23:37:42.427077 1 static_autoscaler.go:401] No unschedulable pods
I0215 23:37:42.427091 1 static_autoscaler.go:448] Calculating unneeded nodes
I0215 23:37:42.427104 1 pre_filtering_processor.go:66] Skipping ip-10-1-4-225.ec2.internal - node group min size reached
I0215 23:37:42.427111 1 pre_filtering_processor.go:66] Skipping ip-10-1-19-186.ec2.internal - node group min size reached
I0215 23:37:42.427151 1 static_autoscaler.go:502] Scale down status: unneededOnly=true lastScaleUpTime=2022-02-15 23:23:16.738060926 +0000 UTC m=+88818.839449992 lastScaleDownDeleteTime=2022-02-15 20:51:39.433862308 +0000 UTC m=+79721.535251364 lastScaleDownFailTime=2022-02-14 22:43:18.998809906 +0000 UTC m=+21.100198962 scaleDownForbidden=true isDeleteInProgress=false scaleDownInCooldown=true
The autoscaler also shouldn't be trying to scale up new nodes with just the 1 pod needing scheduling; that 1 pod should fit onto the 1 c6i.2xlarge
instance that is provisioned by the InstanceGroup
:
Pod desc:
$ kubectl describe pod iris-ksvwg-registration-only -n nodegroup-runners
Name: iris-ksvwg-registration-only
Namespace: nodegroup-runners
Priority: 0
Node: <none>
Labels: pod-template-hash=749fd4569f
Annotations: actions-runner-controller/registration-only: true
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: Runner/iris-ksvwg-registration-only
Containers:
runner:
Image: ghcr.io/myorg/self-hosted-runners/iris:v7
Port: <none>
Host Port: <none>
Limits:
cpu: 2
ephemeral-storage: 10Gi
memory: 4Gi
Requests:
cpu: 1
ephemeral-storage: 10Gi
memory: 2Gi
Environment:
RUNNER_FEATURE_FLAG_EPHEMERAL: true
RUNNER_ORG:
RUNNER_REPO: MyOrg/iris
RUNNER_ENTERPRISE:
RUNNER_LABELS: self-hosted
RUNNER_GROUP:
DOCKERD_IN_RUNNER: false
GITHUB_URL: https://github.com/
RUNNER_WORKDIR: /runner/_work
RUNNER_EPHEMERAL: true
RUNNER_REGISTRATION_ONLY: true
RUNNER_NAME: iris-ksvwg-registration-only
RUNNER_TOKEN: <--REDACTED-->
Mounts:
/runner from runner (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6l7rd (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
runner:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-6l7rd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20s (x5 over 4m3s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.
In this case you need to understand why the nodes are not joining the cluster, the instance-manager controller was also showing this in the logs:
controllers.instancegroup.eks desired nodes are not ready {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}
I would look at the node's kubelet logs & control plane logs to see why they are not joining (you can probably find those in cloudwatch logs & by SSHing to the nodes for the kubelet logs)
Great sounds good, thanks for the pointer. I'll dig through cloudwatch and let you know what I find.
@David-Tamrazov any success with this?
@backjo Do we want to implement some fix for this?
I think this would be slightly problematic considering you don't know which volume the application will use.. e.g. Someone could be attaching EBS volumes, or use the instance store, the storage size would be dynamic from the perspective of the controller.
The workaround of adding a tag to IG spec is actually pretty good:
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?
We could have an annotation to add this tag, but that might be a bit redundant.
Any thoughts?
@eytan-avisror unfortunately not yet, I got pulled aside on a different issue at work and haven't been able to dedicate more time to this. I'm hoping to set up cloudwatch logs for node kubelets this week to get an idea of what might be happening. my suspicion is I'm incorrectly setting up the security group for the instances so that the nodes can't actually communicate with the API server.
@backjo Do we want to implement some fix for this? I think this would be slightly problematic considering you don't know which volume the application will use.. e.g. Someone could be attaching EBS volumes, or use the instance store, the storage size would be dynamic from the perspective of the controller. The workaround of adding a tag to IG spec is actually pretty good:
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?
We could have an annotation to add this tag, but that might be a bit redundant.
Any thoughts?
I'm not sure that we'll be able to be correct enough to implement this well. As you mention, we generally will not know exactly which volume the Node has configured for ephemeral storage. I think the workaround here is suitable for folks. Maybe we could update the documentation around the cluster autoscaler annotation, though, to help awareness of the limitation here.
I wanted to post back and say I circled back to this issue and have resolved it; my InstanceGroup
nodes are running on EKS with warm-pools that are able to scale out from 0 and back to 0 again.
The total fixes ended up being:
- adding autoscaler template tags similar to
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?
for ephemeral storage, memory, and CPU of my nodes - specifying the correct subnet and security groups for my
InstanceGroup
. I had originally created a custom security group for myInstanceGroup
with the wrong Inbound and Outbound rules. This kept the control plane and my nodes from being able to communicate with each other. I ended up pulling my security groups and subnets directly off of the EKS cluster and that worked just fine.
Thanks again for all the help @backjo and @eytan-avisror !
My working InstanceGroup
looks like this:
apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
annotations:
instancemgr.keikoproj.io/cluster-autoscaler-enabled: 'true'
name: my-group
namespace: group-namespace
spec:
strategy:
type: rollingUpdate
rollingUpdate:
maxUnavailable: 50%
provisioner: eks
eks:
configuration:
clusterName: my-eks-cluster
image: ami-12346789
instanceType: t3.large
keyPairName: my-keypair
subnets:
- subnet-abc123
- subnet-def456
securityGroups:
- sg-abc123467
tags:
- key: k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage
value: "20Gi"
- key: k8s.io/cluster-autoscaler/node-template/resources/cpu
value: "2"
- key: k8s.io/cluster-autoscaler/node-template/resources/memory
value: "8Gi"
volumes:
- name: /dev/xvda
type: gp2
size: 32
maxSize: 60
minSize: 0
warmPool:
maxSize: 60
minSize: 15