pulumi/pulumi-kubernetes-operator

Runaway Memory Consumption when `ContinueResyncOnCommitMatch` set to True

Opened this issue ยท 8 comments

What happened?

Runaway memory consumption when setting the ContinueResyncOnCommitMatch annotation to true. This behavior does not occur when setting the annotation to false. This consumption is resulting recurring pod evictions.

Steps to reproduce

Set the continueResyncOnCommitMatch annotation to true and watch pod memory usage. Eventually, pods will be evicted due to exhausting their allowable memory.

Expected Behavior

Setting the continueResyncOnCommitMatch annotation should not impact memory consumption.

Actual Behavior

Setting the continueResyncOnCommitMatch annotation is causing memory exhaustion.

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a ๐Ÿ‘ reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

Assigned to @squaremo for now. I have also asked for some more telemetry information on the pod(s) displaying this behavior.

One thing is that the operator exec's processes, and is PID 1, but doesn't reap defunct processes. So you end up with

$ ps
  PID TTY          TIME CMD
    1 ?        00:00:09 pulumi-kubernet
   11 ?        00:00:00 ssh-agent
  118 ?        00:00:00 pulumi-language <defunct>
  126 ?        00:00:01 pulumi-resource <defunct>
  159 ?        00:00:00 pulumi-resource <defunct>
  317 ?        00:00:00 pulumi-language <defunct>
  326 ?        00:00:01 pulumi-resource <defunct>
  358 ?        00:00:00 pulumi-resource <defunct>
  499 ?        00:00:00 pulumi-language <defunct>
  507 ?        00:00:01 pulumi-resource <defunct>
  539 ?        00:00:00 pulumi-resource <defunct>
  679 ?        00:00:00 pulumi-language <defunct>
  687 ?        00:00:01 pulumi-resource <defunct>
  719 ?        00:00:00 pulumi-resource <defunct>
  865 ?        00:00:00 pulumi-language <defunct>
  873 ?        00:00:01 pulumi-resource <defunct>
  905 ?        00:00:00 pulumi-resource <defunct>
  955 ?        00:00:00 ps

The usual way to fix this is to use a stand-in init process as PID 1 -- e.g., tini.

@squaremo So I finally go around to getting a repro of this issue. I did confirm that setting continueResyncOnCommitMatch to true causes memory consumption to gradually increase over time. Looking at the below graph, you can see that memory consumption is going up gradually. The relatively flat portions of this graph are due to expired AWS credentials, so please ignore that part.

This repro is based off https://github.com/phillipedwards/kubernetes-operator-otel/ which is a massaged example of the aws-s3 example in the pulumi-kubernetes-operator repo. The main difference is my repro will automatically start piping metrics to cloudwatch (container insights) so you can visualize this, over a time series.

Now that we do have a repro, please, let me know if you want me to grab any data from the cluster/cloudwatch as I still have the cluster up and running.

image

@phillipedwards Nice one!

Each one of the ticks on the X axis is three hours, looks like, so I read this as: memory use roughly doubles from 5% to just about 10% in the first 9 hours, then (after a pause) increases by about 1.5% again over 9 hours (and looks like it might be doing the same towards the end of the window shown).
I'm interested by those flurries of memory use every ~12 hours even while the AWS creds were invalid -- any idea what was happening there?

Is there any chance you can give me access? If not, I would love to see the process count over the same period, if that's available. As I said above, the released versions of the operator leak processes, which is not a drastic drain on memory, but might explain changes of O(1%) over several hours.

@squaremo after install the RC which fixes the zombie processes, the zombie processes are no long occurring. Unfortunately, the memory leak appears to be still occurring. Any ideas of how I can help triage this issue to get it going in the right direction?

Any ideas of how I can help triage this issue to get it going in the right direction?

I think a pprof endpoint would be a good start. I can add it, behind a flag, to the operator. That will give us the opportunity to look at fine-grained memory use of the operator process.

This behavior does not occur when setting the annotation to false.

Do we have evidence to support this? I would expect leaks to be more obvious when ContinueResyncOnCommitMatch is set, simply because many more syncs will be undertaken; but that is not the same thing as it causing leaks.

While a leak is a concern and I'd like to track this down: in practice, setting a resource request for a generous amount of RAM (8GB, something like that) will keep this from causing problems.

#381 (the go build cache growing quickly and causing pod eviction) is a bigger concern, since it is 1. sensitive to the inputs (e.g., how many files you have in your repo) as well as how many builds you do; and 2. very easy to trip over, because the default config is to mount ephemeral storage into /tmp, and to put the go build cache there.