operator-framework/operator-sdk

How to troubleshoot a failure during run bundle-upgrade

etsauer opened this issue · 1 comments

Type of question

Best practices
How to implement a specific feature

Question

I'm trying to test the upgrade path for my helm-based operator using the sdk bundle. It's failing because of a missing install plan, but I'm not sure exactly how to get more information about what I'm doing wrong.

What did you do?

# Install current release of the operator
operator-sdk run bundle quay.io/pelorus/pelorus-operator-bundle:v0.0.9 --namespace test-pelorus-operator
# Once installed successfully, attempt the upgrade
operator-sdk run bundle-upgrade quay.io/pelorus/rc-pelorus-operator-bundle:vpr1157-34d9eef --namespace test-pelorus-operator --verbose

What did you expect to see?

I hoped to see the upgrade succeed.

What did you see instead? Under which circumstances?

The install failed after the deleting of the old registry pod:

INFO[0018] Generated a valid Upgraded File-Based Catalog 
INFO[0020] Created registry pod: quay-io-pelorus-rc-pelorus-operator-bundle-vpr1157-34d9eef 
INFO[0020] Updated catalog source pelorus-operator-catalog with address and annotations 
INFO[0021] Deleted previous registry pod with name "quay-io-pelorus-pelorus-operator-bundle-v0-0-9" 
FATA[0120] Failed to run bundle upgrade: install plan is not available for the subscription pelorus-operator-v0-0-9-sub: context deadline exceeded 

I also see the following subscriptions and installplans:

$ oc get subscription -n test-pelorus-operator
NAME                                                            PACKAGE            SOURCE                     CHANNEL
grafana-operator-v4-community-operators-openshift-marketplace   grafana-operator   community-operators        v4
pelorus-operator-v0-0-9-sub                                     pelorus-operator   pelorus-operator-catalog   operator-sdk-run-bundle
prometheus-beta-community-operators-openshift-marketplace       prometheus         community-operators        beta

$ oc get installplan -n test-pelorus-operator
NAME            CSV                       APPROVAL   APPROVED
install-t2fjd   grafana-operator.v4.8.0   Manual     true

NOTE: the prometheus and grafana operators are dependencies of this operator, which is why you see them in this namespace.

Environment

Operator type:

/language helm

Kubernetes cluster type:

OpenShift 4.15

$ operator-sdk version

operator-sdk version: "v1.33.0", commit: "542966812906456a8d67cf7284fc6410b104e118", kubernetes version: "1.27.0", go version: "go1.21.5", GOOS: "linux", GOARCH: "amd64"

$ kubectl version

Additional context

This happens with or without the actual operand resource created, so it seems to be some pretty basic issue, maybe with how the bundle is configured.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale