Update AAW prod nodepools vm sizing

Question

Update AAW prod nodepools vm sizing

Opened this issue 4 months ago · 7 comments

Let's downsize useruc as we don't have as many users anymore as we used to and most of the resources are being wasted.

The usercpu72 nodes use an old F sku that doesn't provide any advantages, let's put the D64as_v5 in that spot for larger workloads. In the future there may be an F64xx_v6 on the way that would be a better fit, but it isn't in canada central right now.

useruc - Standard_D64as_v5 -> Standard_D16as_v5
userpb - Standard_D16s_v3 -> Standard_D16as_v5
usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?

Lower priority

storage - remove/unused?
monitoring - remove/unused?

These changes can be made to the aaw-prod teraform here: https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/terraform-advanced-analytics-workspaces-infrastructure/-/blob/main/prod_cc_00.tf?ref_type=heads

Answer 1 · 2024-09-18T14:54:59.000Z

To be Deployed after work hours on Wed Sept 25, or Oct 2nd since it's a deployment to prod which will affect nodes.

Answer 2 · 2024-09-19T18:32:37.000Z

changes are being made in:
https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/modules/terraform-azure-statcan-aaw-environment/-/merge_requests/51

Pending review and the correct timeframe for merge

Answer 3 · 2024-09-26T14:58:07.000Z

check logic for creating new notebooks which possibly uses the nodepool name (this might be just done with taints/tolartions)
check with census coding team, who is also using f72 nodes that they're good with the changes
check if there's anything particular running on the monitoring/storage nodepools
a. check argocd manifests (or other repos) for tolerated deployments
b. inspect nodes on cluster for particular pods with tolerations
c. match on use is probably the best bet

Answer 4 · 2024-10-01T14:39:27.000Z

No examples found using the monitoring nodepool

Storage nodepool found at https://github.com/StatCan/terraform-kubernetes-aks-daaas-private/blob/master/aks.tf#L133
This repo is archived

Answer 5 · 2024-10-01T14:47:35.000Z

No use of nodepool names is found in github. Safe to assume relationships are only defined by their taints.

Answer 6 · 2024-10-02T15:43:39.000Z

@vexingly

for
usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?

The Census Chat bot team are using the F72 nodes to host a large notebook which doesn't require too much memory (70cpu/128GB mem). If I push them to the D64 machine, chances are nothing else will be scheduled to that node, which leaves ~120GB mem idle.

I'm thinking we just go ahead and create a new node pool. Thoughts?

Answer 7 · 2024-10-02T15:45:44.000Z

A new node pool is probably the easiest path forward right now, we can always clean up the f72 nodes or update them later.