pulumi/pulumi-kubernetes-operator

Restart of pulumi operator leader causes unreleased stack locks

YuriiVarvynets opened this issue · 11 comments

What happened?

If pulumi operator pod is restarted by any reason (for example, during pulumi operator upgrade process) it does not unlock the stack if it was processed during the restart.

error: the stack is currently locked by 1 lock(s). Either wait for the other process(es) to end or delete the lock file with `pulumi cancel`.
  azblob://state/.pulumi/locks/pulumi-operator.dev.global/cb307d24-6243-4d1e-9470-165b0aac381e.json: created by pulumi-kubernetes-operator@pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh (pid 30553) at 2022-11-04T22:06:59Z

Where pulumi-kubernetes-operator-6677a05e-5955496654-8x2mh pod does not exist anymore.

Configuration:

  1. Custom state backend on Azure.
  2. Operator configuration same as deploy-operator-py
  3. Stack configuration:
def createStack(name, metadata, stackProjectRepo, repoDir, envRefs):
    k8s.apiextensions.CustomResource(
        resource_name = name,
        api_version = 'pulumi.com/v1',
        kind = 'Stack',
        metadata=metadata,
        spec = {
            "stack": name,
            "backend": "azblob://state",
            "projectRepo": stackProjectRepo,
            "repoDir": repoDir,
            "gitAuth": {
                "accessToken": {
                    "type": "Secret",
                    "secret": {
                            "name": "git-tkn",
                            "key": "accessToken",
                        },
                }
            },
            "branch": "refs/heads/master",
            "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,
            "envRefs": envRefs,
            "destroyOnFinalize": False,
        },
    );

Steps to reproduce

  1. Deploy multiple Pulumi operator Stacks with:
           "refresh": True,
            "continueResyncOnCommitMatch": True,
            "resyncFrequencySeconds": 120,
  1. Restart leader pod (or upgrade version of operator) when stacks are processed and there is a lock in the backend.

Expected Behavior

Locks are released before pod restart.

Actual Behavior

Locks are NOT released and next operator leader can not perform any actions with Stacks.

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

Thanks for opening the issue @YuriiVarvynets. We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

Would you be able to dump some logs from the operator around the upgrade? You might also want to enable debug logging (--zap-level=debug) and provide the logs for the replacement pod.
cc @squaremo in case I am missing something obvious.

Thank you, @viveklak for taking a look at this. I enabled debug, but it takes time to reproduce the issue.

We do use the lock manager in the controller to make sure only one controller is running at a time: https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/cmd/manager/main.go#L123-L130.

How can I increase leaseDurationSeconds for this lock. As default of 15 seconds is too low for my use case.

# kubectl get leases.coordination.k8s.io pulumi-kubernetes-operator-lock -o yaml -n pulumi
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  ...
spec:
  leaseDurationSeconds: 15
  ...

Perhaps you will need to bump up the graceful shutdown timeout on your pods if your stack updates take a long time?

I have 5 stacks with basic S3 apps: https://github.com/pulumi/examples/tree/master/aws-py-s3-folder
Graceful shutdown is set to default 5 minutes that should be enough for such a simple app to complete. However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right? What are other possible options to avoid such situation?

From the log with debug log level:

E1114 11:27:21.821634       1 leaderelection.go:325] error retrieving resource lock pulumi/pulumi-kubernetes-operator-lock: Get "https://172.16.0.1:443/apis/coordination.k8s.io/v1/namespaces/pulumi/leases/pulumi-kubernetes-operator-lock": context deadline exceeded
I1114 11:27:21.821759       1 leaderelection.go:278] failed to renew lease pulumi/pulumi-kubernetes-operator-lock: timed out waiting for the condition
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ConfigMap","namespace":"pulumi","name":"pulumi-kubernetes-operator-lock","uid":"3df0dd70-932c-4a6e-ab8d-3ba09c372966","apiVersion":"v1","resourceVersion":"327159499"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"error","ts":"2022-11-14T11:27:21.821Z","logger":"cmd","msg":"Manager exited non-zero","error":"leader election lost","stacktrace":"main.main\n\t/home/runner/work/pulumi-kubernetes-operator/pulumi-kubernetes-operator/cmd/manager/main.go:179\nruntime.main\n\t/opt/hostedtoolcache/go/1.19.2/x64/src/runtime/proc.go:250"}
{"level":"debug","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"Lease","apiVersion":"coordination.k8s.io/v1"},"reason":"LeaderElection","message":"pulumi-kubernetes-operator-d5cea0e0-69d96cd79b-wtrvm_5a56dd4a-385d-4276-89f3-305db266afa6 stopped leading"}
{"level":"info","ts":"2022-11-14T11:27:21.821Z","logger":"controller-runtime.manager.controller.stack-controller","msg":"Shutdown signal received, waiting for all workers to finish"}

When this happens pulumi backend lock may not be released.

Thanks for the details @YuriiVarvynets. Adding @squaremo for thoughts.

However, if stack process time is higher than 5 minutes and with possible operator restart during this time we will get lock not released, am I right?

Yes that is correct. I won't consider myself a controller-runtime locking expert, but it appears that failing to renew might result in the graceful shutdown period to be skipped? https://github.com/kubernetes-sigs/controller-runtime/blob/8da9760581ed4f43eee9c2f63764c1cbe0cd3104/pkg/manager/internal.go#L632

I can think of a couple of things we should probably do:

  1. add more control to lock manager parameters, e.g. leaseDurationSeconds
  2. Some sort of a control mechanism to run a pulumi cancel as a prerequisite for a stack if necessary? I do think this could be risky and we would need to consider some safety valves here.

pulumi cancel is risky and should be optional. However, it make sense for me and for others as I was thinking of implementing some automatic tool that would clean up locked stacks anyway.

I see these factors:

  1. if the controller is shutting down, but a stack is still in progress after the graceful shutdown period has passed, the (Pulumi) backend keeps it locked; which means when it's revisited, it can't make progress
  2. failing to renew the lease will cause the operator to exit with no graceful shutdown period

The first item is a problem by itself, and the second makes it worse.

But I am missing why the lease fails to be renewed. @YuriiVarvynets can you explain why "default of 15 seconds is too low for my use case", and if that's the cause of the failed lease renewal -- e.g., are you running the operator in an environment where making an update every 15 seconds is unrealistic?

To be clear, I think it would be OK to have a longer leaseDurationSeconds. It means less sensitivity to failures (it takes longer for a failure to be detected and for another leader to take over), but since stacks typically deploy in O(minutes), it would not be unreasonable to also take minutes to recover.

Update every 15 seconds is okay from infrastructure/load point of view. Right now, I see the issue several times during the week, which does not allow me to use pulumi operator in production.
If adding leaseDurationSeconds and pulumi cancel as configurable option will minimize the possibility of the issue it will be great.

Are there any estimates on resolving this issue ?

I am using azure repo and pulumi operator. But after deploy crd i am getting error message. Do you have any proper documentation on azure operator.