Support over-provisioning with computeClasses
cjellick opened this issue · 4 comments
This feature will allow us to configure a compute class such that we can "overproivsion" pods to a node, based on memory. Meaning, schedule more pods to it than we currently can.
Background
ComputeClasses are acorn's abstraction around scheduling pods. They primarily handle a few things:
- schedule rules such as tolerations, affinity, runtimeClass, and priorityClass
- resource requests and limits. Right now this is limited to memory and CPU.
The intention is that administrators configure these nitty gritty details behind the scenes and end users just need to say they want the "memory-optimized" or "gpu" compute class.
For the context of this feature, we are only concerned with the resource request/limit aspect of computeClasses, not the scheduling aspect.
When you create an acorn, you do two things related to resource consumption:
- You say how much memory each container should get (or you get the default)
- You say which computeClass it should use (or you get the default)
Our controllers translate that into resource requests and limits for memory and cpu. The user never directly specifies CPU. The computeClass has a CPUScaler field which basically says "for X memory give Y millicpu." We do this because we think users more easily understand the implications of memory than cpu scheduling and because in clouds like AWS, the ratio is always predictable, like with M5s you always get 1 cpu for every 4 GB of memory, so by controlling the ratio, we are scheduling more efficiently.
Actual feature
Right now if you say your acorn gets 1GB of memory, that's exactly what gets set for the limit and request for memory. For cpu, we take 1G and math it (I forget how exactly) against the cpuScaler to calculate your cpu request. We do not set a cpu limit.
To handle over provisioning, we want to add a new memoryScaler field (name subject to change) that will change what is set for memory and cpu requests. It should be a decimal representing the percentage of the original value. For example, if my computeClass has memoryScaler: .25
and I set memory
on my acorn to 1G
, then the actual memory limit on the pod should be .25G
(in whatever the right unit should be). Additionally, we should use the scaled value to calculate cpu request. So, with the previous example, if my cpuScaler was .25
, then the cpu request for my pod would be 1000millicpu * .25 * .25
or 62.5 millicpu. Memory limit should remain exactly what the user requested (1G in this case).
When this lands, we should be able to handle "Old" computeClasses that didnt have the field by assuming they have a memoryScaler of 1.
We'll need to make sure this field gets sycned up properly in manager and ultimately we will need to change the computeClasses created as part of manager to have a memoryScaler value of something less than, to actually take advantage of the feature. You'll have to look at our workloads in the saas and see what a reasonable scaler value would be. We also may need to consider having different computeClasses for, say, sandbox vs pro vs byoc vs control plane. But, we'll want to start with something reasonably simple here.
@cloudnautique recently added a field to computeClasses, so looking at that PR would be a good place to start. @tylerslaton did the initial implementation, so he'll be a resource you should lever.
This shouldn't be too hard, just adding the field and then wiring up the math correctly. Happy to help answer any questions that come up.
Tested wth acorn version - v0.10.0-rc2-9-g43dbcbf4+43dbcbf4
- Able to create computeclass with
requestScaler
set
kind: ClusterComputeClass
apiVersion: admin.acorn.io/v1
default: false
metadata:
name: cc2
description: Large compute for linux on arm64
cpuScaler: 0.75
supportedRegions:
- local
memory:
default: 20M
values:
- 10M
- 20M
- 30M
requestScaler: 0.25
- Deploy app with this computeclass. pods get created with the following container spec as expected:
"resources": {
"limits": {
"memory": "20M"
},
"requests": {
"cpu": "4m",
"memory": "5M"
}
},
Thanks for doing the initial certification on the value itself. If you could just do a last verification that this value is set and takes effect in free tier regions we should be good to close this
When deploying apps in free tier regions (with default compute class - default
), there is a "requestScaler of 0.1 which is applied to the deployed apps requests
as expected.