openshift/cloud-credential-operator

error handing is not clear

mjudeikis opened this issue · 11 comments

Azure cluster with broken cloud-credentials-operator should be more verbose and report more clearly.

time="2019-06-18T05:47:06Z" level=debug msg="set ClusterOperator condition" message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded
time="2019-06-18T05:47:06Z" level=debug msg="set ClusterOperator condition" message="4 of 7 credentials requests provisioned, 0 reporting errors." reason=Reconciling status=True type=Progressing
time="2019-06-18T05:47:06Z" level=debug msg="set ClusterOperator condition" message= reason= status=True type=Available
time="2019-06-18T05:47:32Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-azure-temp
time="2019-06-18T05:47:32Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-azure-temp secret=openshift-machine-api/azure-cloud-credentials-test
time="2019-06-18T05:47:32Z" level=error msg="error checking whether credentials already exists: Secret \"azure-credentials\" not found" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-azure-temp secret=openshift-machine-api/azure-cloud-credentials-test

This tells that azure-credentials does not exist in kube-system namespace.
In this state, I call this operator degraded, but status shows differently:

NAME                                 VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                       Unknown     Unknown       True       19h
cloud-credential                     4.2.0-0.okd-2019-06-17-063354   True        True          False      19h

Cluster can't fully function without it.

CCO was built to hopefully get us to a point where root creds are not stored in the cluster itself, particularly for dedicated Hive + SRE fully managed clusters. An early step towards this was to allow the root creds to be deleted after install, and re-instated before an upgrade if needed. Provided all the creds are provisioned, we don't consider this an error state. It might be worth degraded, will talk to the team. (cc @joelddiaz )

The 4 of 7 provisioned 0 reporting errors looks suspect, is this the AWS creds not being provisioned? We are tracking a change to get the CCO to better calculate status with multi-cloud creds in play here: https://jira.coreos.com/browse/CO-443

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

/remove-lifecycle rotten

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.