Update AAW prod nodepools vm sizing
Opened this issue · 7 comments
Let's downsize useruc as we don't have as many users anymore as we used to and most of the resources are being wasted.
The usercpu72 nodes use an old F sku that doesn't provide any advantages, let's put the D64as_v5 in that spot for larger workloads. In the future there may be an F64xx_v6 on the way that would be a better fit, but it isn't in canada central right now.
- useruc - Standard_D64as_v5 -> Standard_D16as_v5
- userpb - Standard_D16s_v3 -> Standard_D16as_v5
- usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
- usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?
Lower priority
- storage - remove/unused?
- monitoring - remove/unused?
These changes can be made to the aaw-prod teraform here: https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/terraform-advanced-analytics-workspaces-infrastructure/-/blob/main/prod_cc_00.tf?ref_type=heads
To be Deployed after work hours on Wed Sept 25, or Oct 2nd since it's a deployment to prod which will affect nodes.
changes are being made in:
https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/modules/terraform-azure-statcan-aaw-environment/-/merge_requests/51
Pending review and the correct timeframe for merge
- check logic for creating new notebooks which possibly uses the nodepool name (this might be just done with taints/tolartions)
- check with census coding team, who is also using f72 nodes that they're good with the changes
- check if there's anything particular running on the monitoring/storage nodepools
a. check argocd manifests (or other repos) for tolerated deployments
b. inspect nodes on cluster for particular pods with tolerations
c. match onuse
is probably the best bet
No examples found using the monitoring nodepool
Storage nodepool found at https://github.com/StatCan/terraform-kubernetes-aks-daaas-private/blob/master/aks.tf#L133
This repo is archived
No use of nodepool names is found in github. Safe to assume relationships are only defined by their taints.
for
usercpu72uc - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpuuc?
usercpu72pb - Standard_F72s_v2 -> Standard_D64as_v5, rename usercpupb?
The Census Chat bot team are using the F72 nodes to host a large notebook which doesn't require too much memory (70cpu/128GB mem). If I push them to the D64 machine, chances are nothing else will be scheduled to that node, which leaves ~120GB mem idle.
I'm thinking we just go ahead and create a new node pool. Thoughts?
A new node pool is probably the easiest path forward right now, we can always clean up the f72 nodes or update them later.