keikoproj/instance-manager

Cluster Autoscaler unable to scale up nodes from EKS autoscaling group warm pool when pods request ephemeral storage

David-Tamrazov opened this issue ยท 26 comments

Is this a BUG REPORT or FEATURE REQUEST?:

A bug report.

What happened:

Cluster Autoscaler is unable to scale up nodes from warm pools of an InstanceGroup with the eks provisioner when the unscheduled pods request additional ephemeral storage:

I0215 16:21:32.873817       1 klogx.go:86] Pod nodegroup-runners/iris-kdr5j-mhntj is unschedulable
I0215 16:21:32.873855       1 scale_up.go:376] Upcoming 1 nodes
I0215 16:21:32.873910       1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0215 16:21:32.873939       1 scale_up.go:449] No pod can fit to eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7
I0215 16:21:32.873989       1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on tm-github-runners-cluster-nodegroup-runners-iris, predicate checking error: Insufficient ephemeral-storage; predicateName=NodeResourcesFit; reasons: Insufficient ephemeral-storage; debugInfo=
I0215 16:21:32.874023       1 scale_up.go:449] No pod can fit to tm-github-runners-cluster-nodegroup-runners-iris
I0215 16:21:32.874050       1 scale_up.go:453] No expansion options

This is the configured InstanceGroup in question:

apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
  name: iris
  namespace: nodegroup-runners
  annotations:
   instancemgr.keikoproj.io/cluster-autoscaler-enabled: 'true'
spec:
  strategy:
    type: rollingUpdate
    rollingUpdate:
      maxUnavailable: 5
  provisioner: eks
  eks:
    minSize: 1
    maxSize: 10
    warmPool:
      minSize: 0
      maxSize: 10
    configuration:
      labels:
        workload: runners
      keyPairName: iris-github-runner-keypair
      clusterName: tm-github-runners-cluster
      image: ami-0778893a848813e52
      instanceType: c6i.2xlarge

I'm attempting to deploy pods via this RunnerDeployment from the actions-runner-controller. If I remove the ephemeral-storage requests & limits, then Cluster Autoscaler is able to scale up nodes from the warm pool as expected.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: iris
  namespace: nodegroup-runners
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload
                operator: In
                values:
                - runners
      ephemeral: true
      repository: MyOrg/iris
      labels:
        - self-hosted
      dockerEnabled: false
      image: ghcr.io/my-org/self-hosted-runners/iris:v7
      imagePullSecrets:
        - name: github-container-registry
      containers:
        - name: runner
          imagePullPolicy: IfNotPresent
          env:
            - name: RUNNER_FEATURE_FLAG_EPHEMERAL
              value: "true"
          resources:
            requests:
              cpu: "1.0"
              memory: "2Gi"
              ephemeral-storage: "10Gi"
            limits:
              cpu: "2.0"
              memory: "4Gi"
              ephemeral-storage: "10Gi"

What you expected to happen:

For Cluster Autoscaler to be able to scale-in nodes from the warm pool when there's unscheduled pods that request ephemeral storage.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the above instance group (using your own subnet, cluster, security groups) into a cluster with Cluster Autoscaler in it

  2. Deploy any pods with the appropriate node affinity and request limits for ephemeral storage.

  3. Check cluster autoscaler logs and note the failure to scale up

Environment:

  • Kubernetes version: 1.21
$ kubectl version -o yaml
clientVersion:
  buildDate: "2021-08-19T15:45:37Z"
  compiler: gc
  gitCommit: 632ed300f2c34f6d6d15ca4cef3d3c7073412212
  gitTreeState: clean
  gitVersion: v1.22.1
  goVersion: go1.16.7
  major: "1"
  minor: "22"
  platform: darwin/amd64
serverVersion:
  buildDate: "2021-10-29T23:32:16Z"
  compiler: gc
  gitCommit: 5236faf39f1b7a7dabea8df12726f25608131aa9
  gitTreeState: clean
  gitVersion: v1.21.5-eks-bc4871b
  goVersion: go1.16.8
  major: "1"
  minor: 21+
  platform: linux/amd64

Other debugging information (if applicable):

None available as we've deprovisioned our test setup for now but if need be we can reproduce and post additional logs here.

Thanks for filing this, although this seem to be a cluster-autoscaler issue.. does it work without warm pools? e.g. generic auto-scaling based on ephemeral storage?
I'm not sure if there is something we can do on the instance-manager side to support this.

I think this is specifically a 'scale from zero' problem - where CA uses only the tags on the ASG to determine the capacity of a potential new node. I believe this is possible to implement in Instance-Manager - we'd have to inspect the volumes attached to the IG, determine which volume is the volume used for ephemeral storage, and tag the ASG accordingly.

@backjo definitely could be..

@David-Tamrazov could you confirm if this is the case? if min is not 0, does it work?

Let me give it a spin! I did indeed run into that issue before, hence the min: 1 setting on the instance group itself, but it didn't occur to me that min: 0 on the warm pool might be a problem.

re: this being a cluster autoscaler problem, it fully could be. I only figured to post here since I was able to get the Cluster Autoscaler to pull in nodes from an EKS Managed Nodegroup provisioned through CDK for the same pods without issue, so I figured there might be something in the InstanceGroup setup thats missing.

I'll try the following and report back:

  • autoscaling for pods that need ephemeral storage without warm pools
  • autoscaling for pods that need ephemeral storage with warm pools and minSize set to 1 on the warm pool

Would also be good to experiment with tagging the ASG via configuration.tags with k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: "xGi" (replace x with the volume size on your node)

Adding the tag definitely got the autoscaler to pick register the autoscaling group as a suitable candidate, great tip:

I0215 22:23:04.427911       1 static_autoscaler.go:319] 2 unregistered nodes present
I0215 22:23:04.428042       1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 22:23:04.428076       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 22:23:04.428123       1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-0. Ignoring in scale up.
I0215 22:23:04.428168       1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-4gncr marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-1. Ignoring in scale up.
I0215 22:23:04.428185       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0215 22:23:04.428193       1 filter_out_schedulable.go:171] 2 pods marked as unschedulable can be scheduled.
I0215 22:23:04.428204       1 filter_out_schedulable.go:79] Schedulable pods present
I0215 22:23:04.428228       1 static_autoscaler.go:401] No unschedulable pods

However the pod just sits there unscheduled now; there's something happening with the kube-scheduler that's preventing it from placing the Pod on the node I think:

$ kubectl describe pod iris-4v89d-registration-only -n nodegroup-runners                     
Name:           iris-4v89d-registration-only
Namespace:      nodegroup-runners
Priority:       0
Node:           <none>
Labels:         pod-template-hash=749fd4569f
Annotations:    actions-runner-controller/registration-only: true
                kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Runner/iris-4v89d-registration-only
Containers:
  runner:
    Image:      ghcr.io/myorg/self-hosted-runners/iris:v7
    Port:       <none>
    Host Port:  <none>
    Limits:
    
      cpu:                2
      ephemeral-storage:  10Gi
      memory:             4Gi
    Requests:
      cpu:                1
      ephemeral-storage:  10Gi
      memory:             2Gi
    Environment:
      RUNNER_FEATURE_FLAG_EPHEMERAL:  true
      RUNNER_ORG:                     
      RUNNER_REPO:                    MyOrg/iris
      RUNNER_ENTERPRISE:              
      RUNNER_LABELS:                  self-hosted
      RUNNER_GROUP:                   
      DOCKERD_IN_RUNNER:              false
      GITHUB_URL:                     https://github.com/
      RUNNER_WORKDIR:                 /runner/_work
      RUNNER_EPHEMERAL:               true
      RUNNER_REGISTRATION_ONLY:       true
      RUNNER_NAME:                    iris-4v89d-registration-only
      RUNNER_TOKEN:                   <--REDACTED-->
    Mounts:
      /runner from runner (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bz7lp (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  runner:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-bz7lp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  52s (x6 over 4m56s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.

If I run kubectl get nodes I see that the nodes from the autoscaling group don't appear in the list, so I'm not sure if the scheduler is even aware that these nodes exist. I even see 2 unregistered nodes present in the autoscaler logs which could be hinting at the same issue:

image

image

image

$ kubectl get nodes                                                                          
NAME                          STATUS   ROLES    AGE    VERSION
ip-10-1-19-186.ec2.internal   Ready    <none>   4d2h   v1.21.5-eks-9017834
ip-10-1-4-225.ec2.internal    Ready    <none>   4d2h   v1.21.5-eks-9017834

^ might be hard to see but you can tell the nodes listed by kubectl get nodes aren't the same ones from the autoscaling group by their names and the age (the other 2 nodes are 4 days old; these new ones I just created)

Nothing from instance-manager logs jumps out at me as problematic:

2022-02-15T22:31:08.609Z        INFO    v1alpha1        state transition occured        {"instancegroup": "nodegroup-runners/iris", "state": "ReconcileModifying", "previousState": "InitUpdate"}
2022-02-15T22:31:08.609Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:08.609Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:08.610Z        INFO    controllers.instancegroup.eks   updated managed policies        {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.610Z        INFO    controllers.instancegroup.eks   reconciled managed role {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.610Z        INFO    scaling drift not detected      {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:08.610Z        INFO    controllers.instancegroup.eks   bootstrapping arn to aws-auth   {"instancegroup": "nodegroup-runners/iris", "arn": "arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:08.628Z        INFO    controllers.instancegroup.eks   waiting for node readiness conditions   {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:08.629Z        INFO    controllers.instancegroup.eks   desired nodes are not ready     {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}
2022-02-15T22:31:08.629Z        INFO    controllers.instancegroup       reconcile event ended with requeue      {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:08.630Z        INFO    controllers.instancegroup       patching resource status        {"instancegroup": "nodegroup-runners/iris", "patch": "{}", "resourceVersion": "5028928"}
2022-02-15T22:31:18.641Z        INFO    controllers.instancegroup       reconcile event started {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:18.642Z        INFO    v1alpha1        state transition occured        {"instancegroup": "nodegroup-runners/iris", "state": "Init", "previousState": "ReconcileModifying"}
2022-02-15T22:31:18.642Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLaunchConfigurations"}
2022-02-15T22:31:18.731Z        DEBUG   aws-provider    AWS API call    {"cacheHit": false, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:18.771Z        DEBUG   aws-provider    AWS API call    {"cacheHit": false, "service": "iam", "operation": "ListAttachedRolePolicies"}
2022-02-15T22:31:18.816Z        DEBUG   aws-provider    AWS API call    {"cacheHit": false, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:18.818Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "autoscaling", "operation": "DescribeAutoScalingGroups"}
2022-02-15T22:31:18.819Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "eks", "operation": "DescribeCluster"}
2022-02-15T22:31:18.870Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:18.926Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:18.979Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.062Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.115Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "ec2", "operation": "DescribeInstanceTypes"}
2022-02-15T22:31:19.122Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLifecycleHooks"}
2022-02-15T22:31:19.123Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "autoscaling", "operation": "DescribeLaunchConfigurations"}
2022-02-15T22:31:19.123Z        INFO    v1alpha1        state transition occured        {"instancegroup": "nodegroup-runners/iris", "state": "InitUpdate", "previousState": "Init"}
2022-02-15T22:31:19.123Z        INFO    v1alpha1        state transition occured        {"instancegroup": "nodegroup-runners/iris", "state": "ReconcileModifying", "previousState": "InitUpdate"}
2022-02-15T22:31:19.123Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "iam", "operation": "GetRole"}
2022-02-15T22:31:19.124Z        DEBUG   aws-provider    AWS API call    {"cacheHit": true, "service": "iam", "operation": "GetInstanceProfile"}
2022-02-15T22:31:19.124Z        INFO    controllers.instancegroup.eks   updated managed policies        {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.124Z        INFO    controllers.instancegroup.eks   reconciled managed role {"instancegroup": "nodegroup-runners/iris", "iamrole": "tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.124Z        INFO    scaling drift not detected      {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:19.124Z        INFO    controllers.instancegroup.eks   bootstrapping arn to aws-auth   {"instancegroup": "nodegroup-runners/iris", "arn": "arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris"}
2022-02-15T22:31:19.143Z        INFO    controllers.instancegroup.eks   waiting for node readiness conditions   {"instancegroup": "nodegroup-runners/iris"}
2022-02-15T22:31:19.143Z        INFO    controllers.instancegroup.eks   desired nodes are not ready     {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}
2022-02-15T22:31:19.143Z        INFO    controllers.instancegroup       reconcile event ended with requeue      {"instancegroup": "nodegroup-runners/iris", "provisioner": "eks"}
2022-02-15T22:31:19.143Z        INFO    controllers.instancegroup       patching resource status        {"instancegroup": "nodegroup-runners/iris", "patch": "{}", "resourceVersion": "5028928"}

Great, if nodes scale out now, we have an approach for adding support for autoscaler as part of the cluster-autoscaler-enabled annotation. (thanks @backjo ๐Ÿ˜€ )

If you now have a scheduling issue, can you compare the affinity to make sure it's matching e.g. pod affinity vs. node labels.

The pod describe you added shows Node-Selectors: <none> so maybe something is not going through there

I think thats because i'm not explicitly defining a nodeSelector on the pod; in other contexts (i.e. with managed nodegroups via CDK), the affinity works as expected between my nodes and the RunnerDeployment described above. I believe it matches here as well since the autoscaler see's it doesn't match for other nodes in my cluster, but it does consider the pod affinity to match the AutoScalingGroup from the InstanceGroup.

No match with other nodes in the cluster:

I0215 22:03:53.173980       1 klogx.go:86] Pod nodegroup-runners/iris-4v89d-4gncr is unschedulable
I0215 22:03:53.174135       1 scale_up.go:376] Upcoming 1 nodes
I0215 22:03:53.174197       1 scale_up.go:300] Pod iris-4v89d-4gncr can't be scheduled on eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0215 22:03:53.174227       1 scale_up.go:449] No pod can fit to eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7

Successful match for the template nodes from the autoscaling group warmpool:

I0215 22:23:04.427911       1 static_autoscaler.go:319] 2 unregistered nodes present
I0215 22:23:04.428042       1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 22:23:04.428076       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 22:23:04.428123       1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-0. Ignoring in scale up.
I0215 22:23:04.428168       1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-4v89d-4gncr marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-8439533899060027234-upcoming-1. Ignoring in scale up.

I can also see the necessary tags on the ASG:

image

I can't really check the nodes because they're not accessible via kube API; it seems to be unaware of them and I don't think they're registered with the cluster.

Can you confirm you see the affinity on the pod spec for the pods created by the RunnerDeployment?
I think the ASG/Cluster Autoscaler part is fine, nodes come up when pods are pending, but for some reason cannot schedule when node comes up.. so in that case it's either the label is not on the node (which it seems to be) or the affinity is not on the pod?

Yep can confirm the affinity is on the pod spec. Im just thinking, how come there wouldn't be logs for the missing node label if the scheduler found the autoscaling nodes and saw there were no labels present? The current 0/2 nodes are available refers to a managed nodegroup I deployed with the cluster to run system controller, including instance-manager. If the scheduler saw the ASG nodes and found the label missing, it then the log ought to be 0/4 nodes are available:

$ kubectl get pod iris-4v89d-registration-only -oyaml -n nodegroup-runners
apiVersion: v1
kind: Pod
metadata:
  annotations:
    actions-runner-controller/registration-only: "true"
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2022-02-15T22:51:34Z"
  labels:
    pod-template-hash: 749fd4569f
  name: iris-4v89d-registration-only
  namespace: nodegroup-runners
  ownerReferences:
  - apiVersion: actions.summerwind.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Runner
    name: iris-4v89d-registration-only
    uid: 7a0c2b6e-51d6-42cd-8dd4-47d972440550
  resourceVersion: "5041103"
  uid: 67e81669-576f-4f09-b09a-326824750ae9
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: workload
            operator: In
            values:
            - runners
  containers:
  - env:
    - name: RUNNER_FEATURE_FLAG_EPHEMERAL
      value: "true"
    - name: RUNNER_ORG
    - name: RUNNER_REPO
      value: MyOrg/iris
    - name: RUNNER_ENTERPRISE
    - name: RUNNER_LABELS
      value: self-hosted
    - name: RUNNER_GROUP
    - name: DOCKERD_IN_RUNNER
      value: "false"
    - name: GITHUB_URL
      value: https://github.com/
    - name: RUNNER_WORKDIR
      value: /runner/_work
    - name: RUNNER_EPHEMERAL
      value: "true"
    - name: RUNNER_REGISTRATION_ONLY
      value: "true"
    - name: RUNNER_NAME
      value: iris-4v89d-registration-only
    - name: RUNNER_TOKEN
      value: <--Redacted-->
    image: ghcr.io/myorg/self-hosted-runners/iris:v7
    imagePullPolicy: IfNotPresent
    name: runner
    resources:
      limits:
        cpu: "2"
        ephemeral-storage: 10Gi
        memory: 4Gi
      requests:
        cpu: "1"
        ephemeral-storage: 10Gi
        memory: 2Gi
    securityContext:
      privileged: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /runner
      name: runner
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-f8gnv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: github-container-registry
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: runner
  - name: kube-api-access-f8gnv
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-02-15T22:51:34Z"
    message: '0/2 nodes are available: 2 node(s) didn''t match Pod''s node affinity/selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

One possibility is that this is still is a ClusterAutoscaler issue and its not bringing in nodes from warm pools correctly; it spins them up but then doesn't register them with the cluster for some reason. Could be wrong though.

The warm pool instances are not supposed to be part of the cluster, they are spun up and shut-down while in the warm pool. Later, when autoscaling happen instead of spinning up new instances, instances from the warm pool simply move to the 'live' pool of instances and are powered on. The 'faster' scaling here is because we only spend time waiting for the instance to boot instead of boot+provisioned.

As long as you see the instances moving from the warm pool to the main pool when scaling happens this part should be fine.. you can also test the same thing without warm pools to exclude it.

Went over the pictures you added again - are you saying the instances that are spun up in the live pool are not joining the cluster? The instances in the live pool should definitely join the cluster, so there could be another issue here

Went over the pictures you added again - are you saying the instances that are spun up in the live pool are not joining the cluster? The instances in the live pool should definitely join the cluster, so there could be another issue here

Yeah none of the instances from the InstanceGroup are part of the cluster or appear when i run kubectl get nodes, nor is the one live instance I spin up by default with minSize: 1 in the instance group configuration (not the warm pool configuration) able to have a pod be scheduled onto it.

I'm currently redeploying without a warm pool to rule out the warm pool itself.

So I removed the warm pool; only deployed the InstanceGroup; the live node is still not listed when I run kubectl get nodes:

$ kubectl get nodes
NAME                          STATUS   ROLES    AGE    VERSION
ip-10-1-19-186.ec2.internal   Ready    <none>   4d3h   v1.21.5-eks-9017834
ip-10-1-4-225.ec2.internal    Ready    <none>   4d3h   v1.21.5-eks-9017834

I see the instance in the ASG running and healthy. Screenshot from the ASG console:

image

EKS console screenshot:

image

InstanceGroup spec:

$ kubectl describe instancegroup iris -n nodegroup-runners
Name:         iris
Namespace:    nodegroup-runners
Labels:       <none>
Annotations:  instancemgr.keikoproj.io/cluster-autoscaler-enabled: true
API Version:  instancemgr.keikoproj.io/v1alpha1
Kind:         InstanceGroup
Metadata:
  Creation Timestamp:  2022-02-15T23:27:09Z
  Finalizers:
    finalizer.instancegroups.keikoproj.io
  Generation:  1
  Managed Fields:
    API Version:  instancemgr.keikoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:instancemgr.keikoproj.io/cluster-autoscaler-enabled:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:eks:
          .:
          f:configuration:
            .:
            f:clusterName:
            f:image:
            f:instanceType:
            f:keyPairName:
            f:labels:
              .:
              f:workload:
            f:securityGroups:
            f:subnets:
            f:tags:
          f:maxSize:
          f:minSize:
        f:provisioner:
        f:strategy:
          .:
          f:rollingUpdate:
            .:
            f:maxUnavailable:
          f:type:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-02-15T23:27:09Z
    API Version:  instancemgr.keikoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.instancegroups.keikoproj.io":
      f:status:
        .:
        f:activeLaunchConfigurationName:
        f:activeScalingGroupName:
        f:conditions:
        f:currentMax:
        f:currentMin:
        f:currentState:
        f:lifecycle:
        f:nodesInstanceRoleArn:
        f:provisioner:
        f:strategy:
    Manager:         manager
    Operation:       Update
    Time:            2022-02-15T23:28:49Z
  Resource Version:  5053640
  UID:               b3b8c7ac-004e-4ffb-ad5e-6b2b94a8e183
Spec:
  Eks:
    Configuration:
      Cluster Name:   tm-github-runners-cluster
      Image:          ami-0778893a848813e52
      Instance Type:  c6i.2xlarge
      Key Pair Name:  iris-github-runner-keypair
      Labels:
        Workload:  runners
      Security Groups:
        sg-0e8839d84fa962a0e
      Subnets:
        subnet-fad141a3
        subnet-01da7c2d4ac7c3da7
        subnet-fdd141a4
        subnet-03400f942826b18bb
      Tags:
        Key:    k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage
        Value:  10Gi
    Max Size:   10
    Min Size:   1
  Provisioner:  eks
  Strategy:
    Rolling Update:
      Max Unavailable:  5
    Type:               rollingUpdate
Status:
  Active Launch Configuration Name:  tm-github-runners-cluster-nodegroup-runners-iris-20220215232745
  Active Scaling Group Name:         tm-github-runners-cluster-nodegroup-runners-iris
  Conditions:
    Status:                 False
    Type:                   NodesReady
  Current Max:              10
  Current Min:              1
  Current State:            ReconcileModifying
  Lifecycle:                normal
  Nodes Instance Role Arn:  arn:aws:iam::<--redacted-->:role/tm-github-runners-cluster-nodegroup-runners-iris
  Provisioner:              eks
  Strategy:                 rollingUpdate
Events:
  Type    Reason                Age    From  Message
  ----    ------                ----   ----  -------
  Normal  InstanceGroupCreated  5m17s        {"instancegroup":"nodegroup-runners/iris","msg":"instance group has been successfully created","scalinggroup":"tm-github-runners-cluster-nodegroup-runners-iris"}

Autoscaler logs - I'm seeing 1 unregistered nodes present now that there's no warm pool instance scaled up. I'm fairly convinced its talking about the 1 live InstanceGroup instance we have now - this same log read 2 unregistered nodes present with the 1 live instance + 1 warm pool instance before:

I0215 23:37:42.277292       1 static_autoscaler.go:228] Starting main loop
I0215 23:37:42.373011       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7 tm-github-runners-cluster-nodegroup-runners-iris]
I0215 23:37:42.425687       1 auto_scaling.go:199] 2 launch configurations already in cache
I0215 23:37:42.425853       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-02-15 23:38:42.42584869 +0000 UTC m=+89744.527237746
I0215 23:37:42.426168       1 aws_manager.go:315] Found multiple availability zones for ASG "tm-github-runners-cluster-nodegroup-runners-iris"; using us-east-1a for failure-domain.beta.kubernetes.io/zone label
I0215 23:37:42.426890       1 static_autoscaler.go:319] 1 unregistered nodes present
I0215 23:37:42.426971       1 filter_out_schedulable.go:65] Filtering out schedulables
I0215 23:37:42.426982       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0215 23:37:42.427015       1 filter_out_schedulable.go:157] Pod nodegroup-runners.iris-ksvwg-registration-only marked as unschedulable can be scheduled on node template-node-for-tm-github-runners-cluster-nodegroup-runners-iris-3295993556649179818-upcoming-0. Ignoring in scale up.
I0215 23:37:42.427023       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0215 23:37:42.427051       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I0215 23:37:42.427058       1 filter_out_schedulable.go:79] Schedulable pods present
I0215 23:37:42.427077       1 static_autoscaler.go:401] No unschedulable pods
I0215 23:37:42.427091       1 static_autoscaler.go:448] Calculating unneeded nodes
I0215 23:37:42.427104       1 pre_filtering_processor.go:66] Skipping ip-10-1-4-225.ec2.internal - node group min size reached
I0215 23:37:42.427111       1 pre_filtering_processor.go:66] Skipping ip-10-1-19-186.ec2.internal - node group min size reached
I0215 23:37:42.427151       1 static_autoscaler.go:502] Scale down status: unneededOnly=true lastScaleUpTime=2022-02-15 23:23:16.738060926 +0000 UTC m=+88818.839449992 lastScaleDownDeleteTime=2022-02-15 20:51:39.433862308 +0000 UTC m=+79721.535251364 lastScaleDownFailTime=2022-02-14 22:43:18.998809906 +0000 UTC m=+21.100198962 scaleDownForbidden=true isDeleteInProgress=false scaleDownInCooldown=true

The autoscaler also shouldn't be trying to scale up new nodes with just the 1 pod needing scheduling; that 1 pod should fit onto the 1 c6i.2xlarge instance that is provisioned by the InstanceGroup:

image

Pod desc:

$ kubectl describe pod iris-ksvwg-registration-only -n nodegroup-runners
Name:           iris-ksvwg-registration-only
Namespace:      nodegroup-runners
Priority:       0
Node:           <none>
Labels:         pod-template-hash=749fd4569f
Annotations:    actions-runner-controller/registration-only: true
                kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Runner/iris-ksvwg-registration-only
Containers:
  runner:
    Image:      ghcr.io/myorg/self-hosted-runners/iris:v7
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                2
      ephemeral-storage:  10Gi
      memory:             4Gi
    Requests:
      cpu:                1
      ephemeral-storage:  10Gi
      memory:             2Gi
    Environment:
      RUNNER_FEATURE_FLAG_EPHEMERAL:  true
      RUNNER_ORG:                     
      RUNNER_REPO:                    MyOrg/iris
      RUNNER_ENTERPRISE:              
      RUNNER_LABELS:                  self-hosted
      RUNNER_GROUP:                   
      DOCKERD_IN_RUNNER:              false
      GITHUB_URL:                     https://github.com/
      RUNNER_WORKDIR:                 /runner/_work
      RUNNER_EPHEMERAL:               true
      RUNNER_REGISTRATION_ONLY:       true
      RUNNER_NAME:                    iris-ksvwg-registration-only
      RUNNER_TOKEN:                   <--REDACTED-->
    Mounts:
      /runner from runner (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6l7rd (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  runner:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-6l7rd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  20s (x5 over 4m3s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.

In this case you need to understand why the nodes are not joining the cluster, the instance-manager controller was also showing this in the logs:

controllers.instancegroup.eks   desired nodes are not ready     {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}

I would look at the node's kubelet logs & control plane logs to see why they are not joining (you can probably find those in cloudwatch logs & by SSHing to the nodes for the kubelet logs)

Great sounds good, thanks for the pointer. I'll dig through cloudwatch and let you know what I find.

@David-Tamrazov any success with this?

@backjo Do we want to implement some fix for this?
I think this would be slightly problematic considering you don't know which volume the application will use.. e.g. Someone could be attaching EBS volumes, or use the instance store, the storage size would be dynamic from the perspective of the controller.
The workaround of adding a tag to IG spec is actually pretty good:

k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?

We could have an annotation to add this tag, but that might be a bit redundant.

Any thoughts?

@eytan-avisror unfortunately not yet, I got pulled aside on a different issue at work and haven't been able to dedicate more time to this. I'm hoping to set up cloudwatch logs for node kubelets this week to get an idea of what might be happening. my suspicion is I'm incorrectly setting up the security group for the instances so that the nodes can't actually communicate with the API server.

@backjo Do we want to implement some fix for this? I think this would be slightly problematic considering you don't know which volume the application will use.. e.g. Someone could be attaching EBS volumes, or use the instance store, the storage size would be dynamic from the perspective of the controller. The workaround of adding a tag to IG spec is actually pretty good:

k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?

We could have an annotation to add this tag, but that might be a bit redundant.

Any thoughts?

I'm not sure that we'll be able to be correct enough to implement this well. As you mention, we generally will not know exactly which volume the Node has configured for ephemeral storage. I think the workaround here is suitable for folks. Maybe we could update the documentation around the cluster autoscaler annotation, though, to help awareness of the limitation here.

I wanted to post back and say I circled back to this issue and have resolved it; my InstanceGroup nodes are running on EKS with warm-pools that are able to scale out from 0 and back to 0 again.

The total fixes ended up being:

  • adding autoscaler template tags similar to k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ? for ephemeral storage, memory, and CPU of my nodes
  • specifying the correct subnet and security groups for my InstanceGroup. I had originally created a custom security group for my InstanceGroup with the wrong Inbound and Outbound rules. This kept the control plane and my nodes from being able to communicate with each other. I ended up pulling my security groups and subnets directly off of the EKS cluster and that worked just fine.

Thanks again for all the help @backjo and @eytan-avisror !

My working InstanceGroup looks like this:

apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
  annotations:
    instancemgr.keikoproj.io/cluster-autoscaler-enabled: 'true'
  name: my-group
  namespace: group-namespace
spec:
  strategy:
    type: rollingUpdate
    rollingUpdate:
      maxUnavailable: 50%
  provisioner: eks
  eks:
    configuration:
      clusterName: my-eks-cluster
      image: ami-12346789
      instanceType: t3.large
      keyPairName: my-keypair
      subnets:
      - subnet-abc123
      - subnet-def456
      securityGroups:
      - sg-abc123467
      tags:
      - key: k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage
        value: "20Gi"
      - key: k8s.io/cluster-autoscaler/node-template/resources/cpu
        value: "2"
      - key: k8s.io/cluster-autoscaler/node-template/resources/memory
        value: "8Gi"
      volumes:
      - name: /dev/xvda
        type: gp2
        size: 32
    maxSize: 60
    minSize: 0
    warmPool:
      maxSize: 60
      minSize: 15