This repository contains the solution to the auto-scaling incident exercise..
Our production AWS k8s cluster is running on different instance groups, all of them are spot instance based.
During a scaling event, we were unable to acquire new spot instances due to their availability dropping to 0 in the regions we reside in (Although our Max Price was set to the on demand value)
This event caused a major degradation to our performance due to the fact our services were not able to deal with the increasing request throughput.
design a solution that can deal with the described situation, while maintaining high availability and being as cost effective as possible
The solution to the problem will take a few points as assumptions:
- The current spot instance groups have 'working' nodes and didn't terminate. (meaning the current instance groups node number remains the same, just cannot request more).
- The kubernetes cluster was created using kops tool.
- State bucket is already created.
- IAM roles are already created.
-
A cluster resource should be created :
kops create cluster \ --name evya.test.cluster \ --state s3://evya-cluster-state \ --cloud aws \ --master-size t3.large \ --master-count 1 \ --master-zones us-east-1a \ --node-size t3.large \ --node-count 1 \ --zones us-east-1a,us-east-1b,us-east-1c \ --ssh-public-key ~/.ssh/evya_cluster_rsa.pub \
- Once verified with
kops update cluster evya.test.cluster
that what we want to created is correct, runkops update cluster evya.test.cluster --yes
- Once verified with
-
Two spot instance groups should be configured :
-
Spot instance group spot-1
apiVersion: kops/v1alpha2 kind: InstanceGroup metadata: labels: kops.k8s.io/cluster: evya.test.cluster k8s.io/cluster-autoscaler/enabled: true k8s.io/cluster-autoscaler/evya.test.cluster: owned name: spot-1 spec: image: kope.io/k8s-1.16-debian-stretch-amd64-hvm-ebs-2020-01-17 machineType: t3.large maxPrice: $(MAX_PRICE) maxSize: 3 minSize: 1 nodeLabels: kops.k8s.io/instancegroup: spot-1 node-type/spot: "true" role: Node subnets: - us-east-1a - us-east-1b - us-east-1c
-
Spot instance group spot-2
apiVersion: kops/v1alpha2 kind: InstanceGroup metadata: labels: kops.k8s.io/cluster: evya.test.cluster k8s.io/cluster-autoscaler/enabled: true k8s.io/cluster-autoscaler/evya.test.cluster: owned name: spot-2 spec: image: kope.io/k8s-1.16-debian-stretch-amd64-hvm-ebs-2020-01-17 machineType: m5.large maxPrice: $(MAX_PRICE) maxSize: 3 minSize: 1 nodeLabels: kops.k8s.io/instancegroup: spot-2 node-type/spot: "true" role: Node subnets: - us-east-1a - us-east-1b - us-east-1c
-
The proposed solution is to create a third instance group, but rather to define it for 'spot', we define it for on-demand but only as our fallback. Meaning, when an instance group needs to be scaled out, it will first try to scale a spot instance group and if cannot, will scale on-demand instance group.
The proposed solution is taking a usage of Kubernetes Cluster AutoScaler application.
To simply explain, the application takes a few arguments as configuration, explaining to the application which AWS ASG (instance groups) should it be looking for. The application will also listen to kubernetes events regarding deployments / pods and will increase / decrease the cluster size accordingly.
The third on-demand instance group 'on-demand':
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: evya.test.cluster
k8s.io/cluster-autoscaler/enabled: true
k8s.io/cluster-autoscaler/evya.test.cluster: owned
name: nodes
spec:
image: kope.io/k8s-1.16-debian-stretch-amd64-hvm-ebs-2020-01-17
machineType: t3.large
maxSize: 3
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: on-demand
node-type/on-demand: "true"
role: Node
subnets:
- us-east-1a
- us-east-1b
- us-east-1c
taints:
- node-type/on-demand=true:PreferNoSchedule
The AutoScaler app will be configured as follows:
- Cloud provider : AWS
- Skip nodes with local storage: False
- Expander type: Priority
- Node groups: auto discovered:
- Tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/evya.test.cluster
- Tags:
- balance-similar-node-groups: True
Before deploying the autoscaler app, a priority config map needs to be created to let the autoscaler know the priority of each instance group 'priority-expander-cm':
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- .*nodes.*
20:
- .*spot.*
The priority-expander configmap lets the autoscaler know the priority per instance group that is matched with the regex. The highest value wins and it will be auto-scaled first and unless it cannot, it won't scale the others.
AutoScaler's arguments to be passed (entire yaml file can be seen on autoscaler.yaml):
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=priority
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/evya.test.cluster
- --balance-similar-node-groups
After configuring the auto-scaler, whenever an update occurs on the deployment's replica count and the pod is in "Pending" state due to lack of resources, the auto-scaler will try to scale either of the spot instance-groups node count first and if cannot, will scale the on-demand instance group node count.
Even though we deployed an auto-scaler to automatically scale the instance-group's count according to our workload, when running on 'spot-instances' we run into a risk of our instance being terminated by AWS. AWS provides a two minute notice about reclaiming the 'spot-instances' and we need to look after it.
To do so, the solution uses k8s-spot-termination-handler helm2 chart.
There is no need to set any custom values at this point.
Install command:
helm install stable/k8s-spot-termination-handler --namespace kube-system --name spot-term-handler
The k8s-spot-termination-handler
deploys a DaemonSet (one pod per node) and will query the EC2 spot instance termination endpoint to see if the node he is currently on is going to be shut down. If so, the application will gracefully start to shut down all of the pods on the current node to allow them to be provisioned onto a different node.
Even though the solution defines a way to auto-scale into on-demand instances when 'spot' instances aren't available, it won't check to see if any 'spot' instances are back in the pool unless there is a change in the amount of replications.
To make sure we always be on the 'spot' instances and only be on the 'on-demand' instances when it is a must, the solution will use an application called k8s-spot-rescheduler
Taken from their github:
K8s Spot rescheduler is a tool that tries to reduce load
on a set of Kubernetes nodes.
It was designed with the purpose of moving Pods scheduled
on AWS on-demand instances to AWS spot instances
to allow the on-demand instances to be safely scaled down
Because the solution created a new instance group just for 'on-demand' instance types, added a taint of PreferNoSchedule
and node labels that allow us to differentiate between 'spot' instances and 'on-demand' instances, we can configure k8s-spot-rescheduler
with the following config to allow it to work (entire deployment file can be seen on rescheduler.yaml):
command:
- k8s-spot-rescheduler
- -v=2
- --running-in-cluster=true
- --namespace=kube-system
- --housekeeping-interval=10s
- --node-drain-delay=10m
- --pod-eviction-timeout=2m
- --max-graceful-termination=2m
- --listen-address=0.0.0.0:9235
- --on-demand-node-label=node-type/on-demand
- --spot-node-label=node-type/spot
Once deployed, k8s-spot-rescheduler
will look to see if it is possible to migrate pods that exists on the 'on-demand' instance group, to the 'spot' instance group. If so, will start to drain the 'on-demand' node and the default kubernetes schedule will start to work. If not, will do nothing.