planetscale/vitess-operator

Autoscaling and policy-driven automations

christosnc opened this issue ยท 1 comments

Hello everyone, ๐Ÿ˜€

This as a proposal, and a place to discuss about the implementation of autoscaling and policy-driven automations for Vitess.

The general idea is to be able to provide a list of policies / rules (possibly in the spec) for certain events / actions to take place automatically. This would be very useful for specifying custom autoscaling scenarios, or alerts, for example.

The high-level approach to this could be:

  1. We create an "orchestrator" server that takes metrics from our Vitess clusters.
  2. We create some "policies" on when / how to scale up and/or down (based on metrics and limits). Also, we specify the frequency of the check for each policy.
  3. The server checks at the given intervals for each policy and if applicable, runs custom predefined actions to our Vitess clusters.

To be able to achieve this, we need to be able to specify the following info in the spec for any policy:

  • Metrics and limits (ie. shard size > 256GB, avg cpu load > 60%)
    This allows us to specify when our automations will be executed. It involves deciding which metrics are useful, as well as a reliable and accurate way to obtain them.

  • Set of actions when the event gets triggered (ie. execute script, alert, perform backup, etc. )
    This allows us to specify what our automations will do when executed. The "execute script" is really the only necessary one, since it allows for custom made workflows and automations.

  • Interval / frequency of checking (ie. every 1 hour)
    If we don't specify any metric-limit, the automation just runs at the specified interval. (useful for backups, reports)

All this could be tremendously useful, allowing for custom autoscaling (horizontal and vertical), alerts, reports, integrations, and automated backups.

Please give your thoughts and ideas!

This is a followup for a Slack discussion. Please check it out for more info.

To what extent can this be done using Kubernetes primitives already? For example, while we haven't done it yet, we have thought about autoscaling vtgate on the number of concurrent queries it is handling (we find this more predictive of load, and more stable for what is mostly a proxy, than CPU usage).

For Kubernetes Deployments, this can be done using Horizontal Pod Autoscaler or KEDA. What would it take to expose the necessary knobs to something like KEDA? How could such a setup be made more accessible (examples? documentation?)