gardener/etcd-druid

Improve Node Utilization by Reducing Requests for Druid-Managed Pods

Closed this issue · 18 comments

How to categorize this issue?

/kind enhancement

Stories

  • As operator, I want to run my ETCDs efficiently in terms of spent resources, but the usage of the requested CPU and memory is not optimal

What would you like to be added

  • Reduce (H)VPA minAllowed for memory (CPU has none, so nothing to change) as we see landscapes with a very high churn rate and many thousand very small clusters with Gardener (makes it extra easy to get to a new cluster within minutes only to keep them for a day of development or even only some minutes to run a test suite):
    • For etcd-main from 700M down to 300M
    • For etcd-events from 200M down to 60M
  • Change the requests for the static side car containers:
    • For etcd-main/backup-restore from 23m and 128Mi to 20m and 80Mi
    • For etcd-events/backup-restore from 23m and 128Mi to 10m and 40Mi (or drop it from etcd-events, i.e. integrate the validation into the main container, but that seemed difficult in the past without rebuilding the image)

Motivation (Why is this needed?)

We see that requested CPU and memory is not optimally used:

  • etcd-main uses ~30% of its requested CPU and ~20% of its requested memory (atypical, normally CPU is less well used)
  • etcd-events uses ~10% of its requested CPU and ~5% of its requested memory (atypical, normally CPU is less well used)

@shreyas-s-rao commented out-of-band on a Slack (I contacted him before I created this BLI proposal afterwards):

Isn’t HVPA being dropped for etcd?

Yes, the plan is to generally drop it, but when? Until then, we could run more efficiently.

Bringing minAllowed down for the etcds is ok, but only as long as it doesn’t cause too many restarts when the etcd pods start up.

Yes, I know that this is one reason why we have it, but the inefficiencies are high. I was checking out the usage and my recommendations are based on them.

Having said that, when the pods need to grow vertically through multiple more OOMs, it will take longer, but then again, that's the same also for GKE and others. Furthermore, even though we have bin-packing, there is often more than enough head room on a node (after all, we have no pod limits/guaranteed qos, so the node is the limit) and OOMs won't happen. Especially not with the safe-to-evict annotations, but also without I am doubtful that will be an issue.

Hibernated shoots won't be affected either.

The backup-restore container for does more than backup and restore at the moment - it does data validation as well.

Yes, I know that there is the initial hand-shake and validation, but my thinking is:

  • If it's really important and the side car requires actual resources, then why is it not under VPA? 23m and 128Mi are chosen how? These numbers for CPU and memory is required and sufficient how (also for large clusters)?
  • If ETCD events fails (data corruption), no restore can be done and hence the handshake and sidecar isn't even needed as it won't do anything that the druid cannot do from the outside. So, the druid, that we now have and didn't had before, could easily detect that state and orchestrate the needful and drop the PV and get ETCD events up from a clean slate.
  • Even if, we do not have limits and I don't think, 23m and 128Mi are magic numbers that will hold true in all cases (we do need a lot less and will probably need a lot more for larger clusters), but the validation load happens only at startup and not continuously. These are tiny amounts, CPU is even compressible, and nodes will have the headroom for that.
  • ETCD main might undergo restoration, but the same argument applies. The numbers are static and the node should and will easily have the necessary headroom practically always.

It makes sense from a cost perspective to actually run all operations within the etcd container itself, so that the same resources that are used for DB validation and restoration can then be continued to be used for the etcd process as well, thereby saving costs on the backup-restore container.

Yes, long-term that would be great. We didn't go there, becaue of build-time reasons only, if I remember correctly. However, these are larger changes for later, I guess.

So coming back to your question, it isn’t currently easy to reduce requests for the backup sidecar.

I wouldn't agree - do we have tickets indicating side-car container issues? We are just careful and would like "to avoid touching a running system", but are these numbers not waste and how are they sufficient for large clusters then?

If it needs the resources, it needs the resources.

But that's not how it's set up. It's all static - is validation and restoration really static at 23m and 128Mi? And as I said, the buffer is the node.

And currently we have a problem that if a full DB validation or restoration is required, then the backup sidecar needs more resources, so it would frequently restart as the VPA recommendation slowly increases for it, and finally stabilizes at some point when restoration finally succeeds.

So, you are saying that 23m and 128Mi are sufficient for large ETCDs? How did you pick these numbers?

Furthermore, isn't it the case that when and while the validation happens, ETCD itself is on hold (when the pod started, it hasn't even consumed any relevant resources yet). That means, that the cgroup isn't burdened by the usage of the main container. The main container is always idle in this phase, so CPU is plentiful/must be available, because the main container got it and isn't using it in validation at all. The side car can have 0 CPU and will still get enough of it, because it's in the very same cgroup also the main container is that must have everything that is required for normal operation anyway (but that idles). As for memory, it depends on when the validation runs. As far as I remember, the validation runs at the start of the pod and then never again, but at the start of the pod, the main ETCD cannot have consumed relevant memory resources yet, so the same argument will probably also hold for memory?

This consideration is about the events ETCD, but I am also wondering about the main ETCD. If that holds true also there, the main ETCD side car needs only the resources for the backup process as the restoration is similar to the validation phase. Nothing needs to be considered (even 0 CPU and 0 memory would be fine as the main container has the allotted resources and isn't using them yet obviously if a restoration is necessary). Of course, it shouldn't be 0 CPU and 0 memory because of the parallel backup process, but that's really all that needs to be calculated/considered; no validation and no restoration has to be considered, because at these times, the main container has not and is not using any relevant resources yet. WDYT?

The side car can have 0 CPU and will still get enough of it, because it's in the very same cgroup also the main container is that must have everything that is required for normal operation anyway

All containers within a pod will be under the same cgroup hierarchy but each container will get its own cgroup (k8s already supports cgroups v2) via cgroup namespace. But since there are no limits defined at the container level in the etcd pod, the etcdBR container will be able to use more resources than requested and as you stated above, if the main etcd container is not using requested resources, those will then be made available to the etcdBR container temporarily if the resources required are more than requested at the container level.

Meeting minutes from an out-of-band discussion between myself, @vlerenc, @unmarshall , @ishan16696 , @renormalize , @seshachalam-yv, @ashwani2k :

  • It is ok to reduce the VPA minAllowed for etcd container for both etcd-main and etcd-events, as recommended by @vlerenc , but to 500Mi and 100Mi respectively (to be on the safer side for etcd-main)
  • For backup-restore container, it would be safe to reduce to 5m and 5Mi for etcd-events, as recommended by @vlerenc , since the etcd process runs only in one of the two containers at any given point of time, and the requests set aside for the etcd container can be "re-used" by the backup-restore container during initialization, so the total requests for the backup-restore container itself can be minimal, which are mainly used for the general functioning of the backup sidecar (snapshotting when backups are enabled, and also maintenance (defragmentation)).

Action items:

  • @vlerenc to provide per-container usage stats for etcd pods
  • Do an analysis for backup-restore container for etcd-main (backups enabled), to check how memory is used, especially during initialization (DB validation + possible restoration).

Here are some sample stats on the 4 ETCD containers:

test_landscape_foo:
  etcd-main:
    etcd:
      metrics:
        allocatable:
          cpu: '8421'
          memory: 31.1T
        efficiency:
          used_requested_cpu: 27 %
          used_requested_memory: 39 %
        requests:
          allocatable_cpu: '674'
          allocatable_max_cpu: 1836m
          allocatable_max_memory: 28G
          allocatable_memory: 3.1T
          containers: 2592
          pods: 2592
        usage:
          allocatable_cpu: '182'
          allocatable_max_cpu: 1474m
          allocatable_max_memory: 5.8G
          allocatable_memory: 1.2T
          allocatable_upper_percentile_cpu: 148m
          allocatable_upper_percentile_memory: 918.6M
          samples: 2588
      pod_issues:
        ContainerRestarted: 10
        Error: 4
        OOMKilled: 1
        Unhealthy: 89
        Unknown: 5
    backup-restore:
      metrics:
        allocatable:
          cpu: '8421'
          memory: 31.1T
        efficiency:
          used_requested_cpu: 51 %
          used_requested_memory: 49 %
        requests:
          allocatable_cpu: '60'
          allocatable_max_cpu: 23m
          allocatable_max_memory: 128M
          allocatable_memory: 324G
          containers: 2592
          pods: 2592
        usage:
          allocatable_cpu: '31'
          allocatable_max_cpu: 104m
          allocatable_max_memory: 541.7M
          allocatable_memory: 158.3G
          allocatable_upper_percentile_cpu: 21m
          allocatable_upper_percentile_memory: 103.2M
          samples: 2588
      pod_issues:
        ContainerRestarted: 22
        Error: 13
        Unknown: 4
  etcd-events:
    etcd:
      metrics:
        allocatable:
          cpu: '8421'
          memory: 31.1T
        efficiency:
          used_requested_cpu: 14 %
          used_requested_memory: 26 %
        requests:
          allocatable_cpu: '309'
          allocatable_max_cpu: 300m
          allocatable_max_memory: 2.1G
          allocatable_memory: 1.3T
          containers: 2592
          pods: 2592
        usage:
          allocatable_cpu: '43'
          allocatable_max_cpu: 199m
          allocatable_max_memory: 2.5G
          allocatable_memory: 330.7G
          allocatable_upper_percentile_cpu: 38m
          allocatable_upper_percentile_memory: 289.8M
          samples: 2588
      pod_issues:
        ContainerRestarted: 17
        Unhealthy: 69
        Unknown: 17
    backup-restore:
      metrics:
        allocatable:
          cpu: '8421'
          memory: 31.1T
        efficiency:
          used_requested_cpu: 35 %
          used_requested_memory: 26 %
        requests:
          allocatable_cpu: '60'
          allocatable_max_cpu: 23m
          allocatable_max_memory: 128M
          allocatable_memory: 324G
          containers: 2592
          pods: 2592
        usage:
          allocatable_cpu: '21'
          allocatable_max_cpu: 23m
          allocatable_max_memory: 101.8M
          allocatable_memory: 84.7G
          allocatable_upper_percentile_cpu: 13m
          allocatable_upper_percentile_memory: 40.1M
          samples: 2588
      pod_issues:
        ContainerRestarted: 22
        Error: 5
        Unknown: 17
test_landscape_bar:
  etcd-main:
    etcd:
      metrics:
        efficiency:
          used_requested_cpu: 21 %
          used_requested_memory: 40 %
        requests:
          allocatable_cpu: '1945'
          allocatable_max_cpu: 4000m
          allocatable_max_memory: 30G
          allocatable_memory: 8.3T
          containers: 6684
          pods: 6684
        usage:
          allocatable_cpu: '415'
          allocatable_max_cpu: 1256m
          allocatable_max_memory: 4.4G
          allocatable_memory: 3.3T
          allocatable_upper_percentile_cpu: 112m
          allocatable_upper_percentile_memory: 945M
          samples: 6683
      pod_issues:
        ContainerRestarted: 26
        Error: 1
        OOMKilled: 5
        Unhealthy: 69
        Unknown: 15
    backup-restore:
      metrics:
        efficiency:
          used_requested_cpu: 37 %
          used_requested_memory: 41 %
        requests:
          allocatable_cpu: '154'
          allocatable_max_cpu: 23m
          allocatable_max_memory: 128M
          allocatable_memory: 835.5G
          containers: 6684
          pods: 6684
        usage:
          allocatable_cpu: '57'
          allocatable_max_cpu: 146m
          allocatable_max_memory: 828M
          allocatable_memory: 341.3G
          allocatable_upper_percentile_cpu: 15m
          allocatable_upper_percentile_memory: 81M
          samples: 6684
      pod_issues:
        ContainerRestarted: 33
        Error: 16
        Unknown: 13
  etcd-events:
    etcd:
      metrics:
        efficiency:
          used_requested_cpu: 40 %
          used_requested_memory: 55 %
        requests:
          allocatable_cpu: '307'
          allocatable_max_cpu: 300m
          allocatable_max_memory: 4.3G
          allocatable_memory: 1.8T
          containers: 6684
          pods: 6684
        usage:
          allocatable_cpu: '124'
          allocatable_max_cpu: 331m
          allocatable_max_memory: 2G
          allocatable_memory: 1022.4G
          allocatable_upper_percentile_cpu: 32m
          allocatable_upper_percentile_memory: 282.9M
          samples: 6684
      pod_issues:
        ContainerRestarted: 86
        Unhealthy: 38
        Unknown: 86
    backup-restore:
      metrics:
        efficiency:
          used_requested_cpu: 26 %
          used_requested_memory: 29 %
        requests:
          allocatable_cpu: '154'
          allocatable_max_cpu: 23m
          allocatable_max_memory: 128M
          allocatable_memory: 835.5G
          containers: 6684
          pods: 6684
        usage:
          allocatable_cpu: '41'
          allocatable_max_cpu: 19m
          allocatable_max_memory: 136M
          allocatable_memory: 240G
          allocatable_upper_percentile_cpu: 10m
          allocatable_upper_percentile_memory: 41.8M
          samples: 6684
      pod_issues:
        ContainerRestarted: 100
        Error: 14
        Unknown: 82

Unless we want to put the backup-restore sidecar under VPA, the 95th percentile for CPU and memory are ~20m and ~100M (above called upper_percentile). If you would go for the maximum, it would be almost 10x as much and far beyond the current requests.

In summary, we decided already to set smaller requests for the etcd-events sidecar container:

  • From 23m and 128Mi to 5m and 5Mi

And now we probably also can lower the requests for the etcd-main sidecar container slightly, if you feel comfortable with the 95th percentile:

  • From 23m and 128Mi to 20m and 100Mi

FYI: I updated the ticket main description based on the above numbers.

@shreyas-s-rao I hope we can get the first set of changes in quickly (the side cars only when we change the statefulset anyway)? Those are relatively straight-forward, therefore simple, with minimal risk (if at all), and are all data-driven. Whatever is more complicated can be done later, but let's get those "trivial" number changes in. That would be great.

Yes @vlerenc . Changes from this issue as well as #767 are to be made in g/g codebase, after we make the next druid release and update that in g/g (to avoid multiple rollouts of etcd). The next druid release will include the large PR #777 , hence it might take some more time for it to be reviewed and merged before we can make a druid release. Hope the delay is ok.

If it's still urgent, and since it technically doesn't depend on the new druid changes, and will help in immediate cost savings, we can make the changes immediately in g/g, which will go in the next g/g release. WDYT?

It depends on what the chances are for the review @shreyas-s-rao. :-) It's not urgent and we'd like to bundle our changes to roll less often. If the review process takes years (like with GEP-23), I'd rather get it in before that. If it's weeks, then don't rush it. If it's months, let's talk how many.

Haha sure. It's definitely just weeks, so we could possibly wait.

Hello,
Do we have an ETA on reducing ETCD requests? This task is overlapping with an undergoing less focused initiative to optimise overall resource configuration for the garden control plane. Ideally, the ETCD part should be covered here, so it can benefit from expert knowledge.

Hi @shreyas-s-rao, is there any update on the first 2 bullet points (dropped the last bullet points because they are tracked now elsewhere), i.e. right-sizing minAllowed and the initial requests?

@vlerenc as mentioned earlier, this change needs to be made on g/g, and ideally clubbed with an etcd rollout, like the upcoming druid:v0.23.0 release. I will make the change to the requests for backup sidecar when integrating druid v0.23.0 into g/g. Hope that's ok with you. If not, then we can still make the change immediately on g/g, which will cause a second rollout of the etcd pods.

Hi @shreyas-s-rao. Sure, we can club it together. I am only after getting it done - nothing has to be done immediately, but eventually would be nice (there are so many small items everywhere and this one here is from April).

BTW, you said requests for the side-cars (second part), but the first part is about minAllowed for the main containers. I am fine if you do them together with the requests changes, but a.) you didn't mention the minAllowed changes in your reply and b.) it will not roll anything. Just for transparency (in this ticket, so that I know when to look again), do you intend to do them immediately or together with the other changes?

@shreyas-s-rao @ashwani2k @unmarshall Will this come with 0.23? I see the values are still unchanged and 0.23 is already in.

Not changing the values as recommended in this ticket skews the data and the actual requests on all our landscapes. I see in the data, that we have an accumulation at 700M resp. 200M and we waste through that and the initial requests money.

Worse, it skews the CPU to memory ratio and blocks machine type optimizations.

I know I said that it's not critical and it wasn't, but considering that this ticket is now open since almost half a year, please make it a priority to change the value now. Not in one month, but right now.

We have hundreds of components that can be optimised, but because ETCD is a large one - with a dedicated pool even - I opened a ticket just for this component for you with new values extracted from data. It actually matters how much resources it consumes and that the actual CPU to memory requests are not skewed by wrong minAllowed and overall the usage drops to reasonable values by right minAllowed and initial requests. It's time, if you haven't done it already. And if you have, apologies.

Hey @vlerenc . The plan is (and always was) to change these values in the g/g integration PR which upgrades druid to v0.23.0. Apologies if I previously meant that the changes you recommended would be done to the default values in druid itself.

Since g/g is a consumer of druid, and the consumption is not simply using a statefulset with specific initial resource requests, but also along with a VPA with minAllowed values, and the VPA resource deployed by gardenlet, it makes sense to make the recommended changes from this ticket in the gardenlet code that creates the Etcd and VPA resources, rather than defaults (ideally gardener (or any code for that matter) should not be relying on default values from upstream components).

Hope this clears the air. Apologies for any previous miscommunication on my part.

@shreyas-s-rao Yes, you explained so, but I do not see a PR - all I see is gardener/gardener#10506.

Can you please point me to it? I pulled the code prior to writing my comments and see e.g. in pkg/component/etcd/etcd/etcd.go still the old values and no PR that would change them. Am I looking wrong?

Yet to raise the PR on the repo. Running some final tests. But rest assured, the changes from this ticket will be part of the PR.

Thank you @shreyas-s-rao and apologies for not asking first out-of-band.