Explore concept of VNA - Vertical Node Autoscaler

Question

Explore concept of VNA - Vertical Node Autoscaler

eytan-avisror opened this issue 2 years ago · 3 comments

In some cases, it may be appropriate to scale nodes vertically, i.e. from m5.xlarge to m5.2xlarge.
For example, when we detect better binpacking may occur, or when the IG reaches the max and there are pending pods.

e.g.

We can try to abstract instance type completely, example:

apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
  name: my-instance-group
  namespace: instance-manager
spec:
  provisioner: eks
  strategy:
    type: rollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  eks:
    minSize: 3
    maxSize: 6
    configuration:

      # < instanceType not provided >

      instanceFamily: m5  # optional

      resources:
        requests:
          mem: 8Gi
          cpu: 2
        limits:
          mem: 64Gi
          cpu: 16
      ...

Initially spin up m5.xlarge (if instanceFamily is provided, otherwise we can decide the best match) which provides 2vcpu/8Gi mem, and we can scale up to m5.4xlarge which has 16/64 respectively.

Another option is to keep this new spec inside VerticalScalingPolicy so that the IG simply does not provide instanceType and VSP can be provided as follows:

apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: VerticalScalingPolicy
metadata:
  name: default
  namespace: instance-manager
spec:

  instanceFamily: m5  # optional

  resources:
    requests:
      mem: 8Gi
      cpu: 2
    limits:
      mem: 64Gi
      cpu: 16

  scaleTargetRef:
      apiVersion: instancemgr.keikoproj.io/v1alpha1
      kind: InstanceGroup
      name: my-instance-group

We should also probably explore supporting something like HPA's behavior spec based on node capacity

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100 // should be between 0 and 40
      periodSeconds: 15
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
    - type: Pods
      value: 4
      periodSeconds: 15
    selectPolicy: Max

@backjo any thoughts on this, would you find this useful?

Answer 1 · 2022-04-01T20:45:32.000Z

I could see it being useful - though we just use multiple IGs right now with scale from zero enabled and it solves it for us. CA does a decent job of scaling between them. It is a bit tedious though.

Answer 2 · 2022-04-01T20:51:07.000Z

@backjo interesting, so you keep multiple IG on min 0, and in case you need to scale up beyond max of ASG-1 - ASG-2..N. would scale up additional nodes for you? How does CA know which ASG to scale?
In this case would it make more sense to scale vertically with a single IG instead and keep the same range of nodes? e.g. min 3 / max 10

Answer 3 · 2022-04-12T13:10:29.000Z

More like - we have multiple IGs with different compute / memory requirements. CA is configured to least-waste