openshift/enhancements

CI Operator Visualizer Tool

michaelgugino opened this issue · 7 comments

A tool to enable engineers to more quickly diagnose the causes of failed CI runs. This tool should run automatically after/during artifact gathering and display an easy to consume visualization with important details for each operator.

Long term, it would be nice to present information about operator statuses across CI runs in an OK/NotOK fashion, so a particular team, such as machine-api, can look at all CI runs and see if there was any detected problem with their particular operator.

Some discussion points from recent arch call:

1q) Operators should do more to detect conditions and present the information
1a.1) Operators do present some of that information today, but how accessible is it to the end user? For engineers looking at a CI run, that information is embedded in a large json file, and nobody really looks at it.
1a.2) For the machine-api specifically, our operator status is not tied to the status of individual machines as they are user tuneable. Often times, there's no reason to look at machine objects as they are not a primary indication (eg, etcd is down to 2/3 replicas would be the alarm) and it takes some effort to work through the layers of the stack and see a machine is having a problem and might be the root cause (or might point you to a different root cause).

2q) This information should integrate with insights/mustgather/run it in the cluster
2a) I don't disagree, that would be nice follow-on work. We can prove it in the CI as it will be most useful there immediately (we're diagnosing failed clusters)

3q) Operators should be able to self correct known conditions
3a) If etcd wants to deploy 3 replicas but can't because we don't have enough control plane hosts, it's powerless to do anything to correct that. If the machine-api can't create a new control plane host because the machineset is misconfigured or the AZ went dark, there's nothing we can do to correct that.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.