StatCan/aaw

AAW Dev: Determine what machine types dev nodes are on and re-size if necessary.

Closed this issue · 12 comments

Determine what kind of machines we are on, as we can move to Standard_D2ds_v5 to align with FinOps' suggestion.
This task will want the following

  • Document what machine types we are on for each nodepool (cloudmainsys, general, system, useruc)
  • Determine if we would benefit (financially) form moving to the Standard-D2ds_v5 ($106/month)
    • This is a very likely yes, we should also see how it all fits together, with new requests and machine sizes
  • If we do benefit, make the PR to change our infrastructure to deploy the Standard-D2ds_v5 machines over the ones found in step 1

A catch here is not to forget to downsize the aks-cloudmainsys nodepool whose config can be found here

Machines types on dev cluster by nodepool:
nodepool: system
output from K9s:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: Standard_D8s_v3
kubernetes.azure.com/node-image-version: AKSUbuntu-2204gen2containerd-202411.12.0
kubernetes.azure.com/storagetier: Premium_LRS
topology.kubernetes.io/zone:
one VM on canadacentral-2
one VM on canadacentral-3
output from Azure dashboard:
target nodes: 2
scale method: autoscale
All other info conforms to K9s info.

nodepool: general
output from K9s:
beta.kubernetes.io/arch amd64
beta.kubernetes.io/instance-type: Standard_D8s_v3
kubernetes.azure.com/node-image-version: AKSUbuntu-2204gen2containerd-202411.12.0
kubernetes.azure.com/storagetier: Premium_LRS
topology.kubernetes.io/zone:
two VMs on canadacentral-2
two VMs on canadacentral-3
one Vm on canadacentral-1
output from Azure dashboard:
target nodes: 5
scale method: autoscale
All other info conforms to K9s info.

nodepool: cloudmainsys
output from K9s:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: Standard_D16s_v3
kubernetes.azure.com/node-image-version: AKSUbuntu-2204gen2containerd-202411.12.0
kubernetes.azure.com/storagetier: Premium_LRS
topology.kubernetes.io/zone: "0"
output from Azure dashboard:
target nodes: 1
scale method: autoscale
All other info conforms to K9s info.

Unused nodepools:
All these have target nodes set to 0.
usercpu72pb
Node size: Standard_F72s_v2
usercpu72uc
Node size: undefined
usergpu4pb
Node size: Standard_NC24s_v3
usergpu4uc
Node size: Standard_NC24s_v3
usergpupb
Node size: Standard_NC6s_v3
usergpuuc
Node size: Standard_NC6s_v3
userpb
Node size: Standard_D16s_v3
useruc
Node size: Standard_D16s_v3

Now looking into the machine type specs for the listed machines, and some documentation on how to optimize machine type, and what logic aks nodepools use to scale up and scale down their nodepools.

Resources:
Microsoft learn links:

Jacek has some questions about the context of this issue:
Please add details here, which relate to this ticket

Fortunately, for cloudmainsys, the determination is really simple. There is only one VM running on the nodepool, and it's clearly underutilized with respect to cpu, memory, and disk space. See below:
cloudmainsys-node-utilization
So in this case we don't have to worry about any scaling behaviour, we know that all the workloads will fit comfortably on one machine of type Standard_D2ds_v5 (2 cores, 16GiB memory).

For the system nodepool, there are two machines of type Standard_D8s_v3, but both are clearly underutilized, with cpu at less than 10% and memory at less than 20%. See below:
system-node-utilization
I noticed that in the terraform file that configures this nodepool the minimum number of nodes is set to 2, presumably for georedundancy reasons. See link: https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/terraform-advanced-analytics-workspaces-infrastructure/-/blob/main/dev_cc_00.tf?ref_type=heads#L70. And we can see that one machine is in availability zone 2 and the other in zone 3. Their workloads can also fit on a pair of machines of type Standard_D2ds_v5 as above.

For the general nodepool, see issue aaw-1965 for the reasoning behind selecting the memory optimized machine model Standard_E4ds_v5. The terraform change for that nodepool is also linked in that issue.

Some additional comments on VM selection based on my research follows below with links:

Microsoft Mechanics video (link: https://www.youtube.com/watch?v=zOSvnJFd3ZM)
Main takeaways:
If your workloads utilize one resource type (cpu, memory, storage) more heavily than the other types then choose the appropriate resource optimized VM families.
D family for standard workloads
M family for memory heavy workloads
F family for cpu intensive workloads
L family for throughput heavy workloads with large persistent storage needs

Some notes on azure's virtual machine model nomenclature:
(links: https://learn.microsoft.com/en-us/azure/virtual-machines/vm-naming-conventions, https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/overview)

Family codes:
A-family
For entry-level workloads such as low-traffic web servers.
B-family (B for burstable)
Uses a different costing model: during low utilization periods computing credits are accumulated, then these are consumed when machines are running above a certain threshold. Once all credits are consumed the machine is throttled back to below the computing threshold until more credits are accumulated.
D-family
For general purpose enterprise grade applications.
DC-family
Like D family but with hardware-based Trusted Computing Environments (TEEs) for confidential computing.
F- and FX-family
Compute optimized machines have high CPU-to-memory ratios.
E- and Eb and M families
Memory optimized machines with high memory-to-CPU ratios.
The Eb seems like a hybrid between E and L in that it also has high throughput remote storage capabilities.
EC-family
Like E family but with hardware-based TEEs for confidential computing.
L-family
Storage optimized machines, with high disk throughput and large local disks.
NC- ND- NG- NV-families
GPU accelerated, offering either single, multiple, or fractional GPUs.
For gaming or viualization or GPU-adopted machine-learning workloads.

Feature codes:
CPU related:
no cpu code := Intel based CPU
a := AMD based CPU
p := ARM based CPU
memory related:
t := tiny amount of memory per core in relation to other VMs in this family
l := less memory per core than other machine series in this family
m := more memory per core than other machine series in this family
storage related:
d := local data disks can be attached
s := capable of accessing azure premium storage accounts

Notes on cloudtrooper blog post on VM size selection:
(link: https://blog.cloudtrooper.net/2020/10/23/which-vm-size-should-i-choose-as-aks-node)
Arguments in favour of larger machines:
Each VM in Azure has a limit on I/O, which is higher for larger nodes.
Amount of data disks that each VM size can hold increases.
They mention the control plane overhead increases for smaller pods, but don't provide any specific numbers or formulas.
They mention Azure CNI plugin.
For large machine sizes you want to increase pod limit, which defaults to 30.
Otherwise you're just throttling your VMs unnecessarily.
This parameter is only configurable during cluster creation.

Arguments in favour of smaller nodes:
Granularity when scaling: the smaller the node, the smoother the scalability curve.
Think of your total resource requests as a function of time.
Your resource capacity will be a step graph that will try to match that function.
And the virtual machine size will be the size of smallest steps you can use.
Node failure blast radius.
If you have a two node cluster and you lose one node, you'll have 50% of your workloads evicted and looking for a node while one is being re-provisioned.
If you have a ten node cluster that number drops to 10%.
Certain resources are provisioned per node, like SNAT ports (although that number is configurable).
Finally, you want to have a certain minimum number of nodes per Availability Zone in some situations. For instance, if you are using disk PVs you want to restrict pods to certain AZs so they don't try to mount disks in a different AZ. (Not clear on this one.)

Their concluding remarks:
Start with 4-core VMs such as Standard_D4s_v4.
Consider larger VMs if:
Your workloads require more I/O performance per node.
You need more than 8 data disks per node.
Your cluster would grow too large (for example 15-20 nodes).
(Why is that too large? They don't explain. Kubernetes documentation states that maximum number of nodes currently supported is 5000 nodes.)

Summarize and bring up for Thrusday's Elab

One last piece of info I got from the azure pricing calculator. I was wondering how VM pricing scales with different machine sizes in a given machine family. Here are the pricing tables for the D and E families:

Image

Image

I wanted to see if there is an incentive provided to consolidate workloads on larger machines, but there isn't. The pricing scales linearly with number of CPUs and memory size, so it's just as expensive to rent n VMs with x CPU cores and y GiB of memory each, as it is to rent 1 VM with n * x CPU cores and n * y GiB of memory.

The conclusion, supported by discussion we had during Thursday's elab meeting, is that it makes sense to scale down the VMs used in cloudmainsys and system nodepools to the Standard-D2ds_v5 model, as it should be able to handle the existing workloads no problem.

For the general nodepool, we agreed that it's worth trying the Standard_E4ds_v5 memory optimized model, as it should provide better CPU utilization while having enough memory that the general nodepool will not be scaling up from the existing count of 5 VMs.

I also suggested that we can run existing workloads with updated VM types an simply compare the costs incurred during a given time period, and that will give the most concrete evidence whether or not we're running more appropriate machines.