pulumi/pulumi-kubernetes-operator

Improve architecture for horizontal scaling

spender0 opened this issue · 6 comments

Hello!

  • Vote on this issue by adding a 👍 reaction
  • If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Hello Pulumi team! I've been using Pulumi for years and recently started using Pulumi Kubernetes Operator.
Having 40+ stacks based on the same typescript npm project taken care of by one Pulumi operator installation I found design problems in the operator.

When it runs npm install and Pulumi code for several stacks it consumes a lot of CPU and memory. But this happens only after git changes. So most of the time operator pod is doing nothing when there are no git changes. But I need to have it with proper CPU and Memory requests set to avoid OOM Kill. So the pod's resources are underutilized. It is burning money most of the time.

Screenshot 2023-11-03 at 12 50 09 PM

The problem is partly related to #368
When I set little resources operator got OOMKilled during infra provisioning and the stack state file is locked by concurrent update.

In addition to the resource problem, it is not possible to scale up the operator deployment horizontally to increase the speed of syncing the big number of stacks. Only one pod can work on stacks at one moment, for this reason, there is k8s lease locking.

As a solution, I would decouple the "npm install" and "pulumi up" functionality from the operator pod into a worker pod so the operator could assign the worker pod onto one stack individually to provision it and once the stack is done the worker should die to save costs. The operator pod should be only a controller for stacks and worker pods. This would make Pulumi Operator more scalable to suit big platforms having hundreds or thousands of stacks.

I would be glad to provide additional information, just let me know.

Affected area/feature

cc @rquitales @EronWright for you awareness.

Note that this is similar to (or potentially ultimately the same as) what’s discussed in #78 (run the deployments as Jobs)

#434 Is another even more extreme option for separating the deployments from the operator compute (running them in Pulumi Deployments instead of directly inside the cluster).

How is concurrency limited when handling stacks all changing at the same time? If OP has 40+ stacks and they're all being refreshed/updated at the same time, would some simple concurrency controls smooth the spike out over a longer time?

How is concurrency limited when handling stacks all changing at the same time? If OP has 40+ stacks and they're all being refreshed/updated at the same time, would some simple concurrency controls smooth the spike out over a longer time?

I set MAX_CONCURRENT_RECONCILES variable in the operator pod to 4. If I set a higher value, e.g 10, the operator will consume way more resources and will be OOM-killed unless I dedicate even more memory to the pod. This will lead to money burning as most of the time the pod is doing nothing as there are no changes in the stacks.

If I leave MAX_CONCURRENT_RECONCILES=4 the update is too slow when all stacks receive a change.

Good to know for someone new to the operator, was just thinking out loud about the concurrency but it makes sense if the update is too slow too. Given those requirements it does feel like pushing those sessions out to Job pods so they can be on demand distributed out to the wider cluster makes sense.

Added to epic #586