[RFE] Operators Health Metric

Question

[RFE] Operators Health Metric

sradco opened this issue 2 years ago · 16 comments

As part of an effort to standardize operators observability, in order to be able to create useful tooling for operator developers and be able to certify operators capability levels based on the requirements and test them regularly using CI, I would like to propose a generic way to report an operator health metric, that will be based on alerts and will give an in-depth way to inspect operators health.

At the moment OLM only reports if the operator csv is in the succeeded state, by the csv_succeeded metric.
This value doesn't "know" if the operator is actually healthy or not.
Operators create their own health metric that cannot be used in generic dashboards and differ in the way they are implemented and calculated, which can lead to inconsistencies.

Motivation

Create a standard for reporting operator health and can be validated using CI and visualized.
Since k8s is built in a way that resources are being created, destroyed and reconciled all the time, I would like the health metric to indicate if there is a real issue with the operator that is based on the duration the issue exists.
For example: If all operator api servers are down, we should not report the operator as unhealthy. Since this can be automatically addressed by k8s. This can be caused for example due to an operator upgrade.
Only if they are down for a period longer then X then we should indicate there is an actual issue.
Drive operators developers to invest in adding metrics and alerts to their operator, which will increase operator observability as a whole.
Avoid code duplication and a clear understanding what impacts the operator health.

Design

There should be 2 health metrics:

Stable health metric - Based on firing alerts and their severity.
Real time health metric - Based on firing + pending alerts that can indicate the operator stability.

If operator has at least 1 critical alert firing that relates to the operator functionality, it means that some important functionality is lost, then its health would become "Unhealthy"(Red).
If the operator has more that X warning alerts, we should consider if the operator should be considered as "At risk" (Yellow).

For the real time alert we can implement a similar calculation, but instead also include pending alerts, even if Kubernetes will fix the issue automatically.

Why should we use alerts to calculate operator health

Operators determine when a functionality is lost or compromised and alert the user once the evaluation time that was set for the alert has passed and the alert started firing.
This evaluation time period is important to determine if there is indeed an issue that Kubernetes was unable to resolve.
It also allows to easily examine why the operator is unhealthy.

Requirements

What operator sent this alert? - Add a generic label for each alert that identifies what operator sent it.
Proposed name: kubernetes_operator_part_of. Label name is based on the Kubernetes Recommended Labels.
Does alert impact operator functionality? - Add a label that will identify alerts that effect the operator health.
Proposed label name:

health_alert: true/false - Operators will need to add this label to all alerts and decide if they impact the operator health.
This is indeed a somewhat subjective label. But, We should require it and for most cases I think that the developers adding the alerts would know if its impacting the operator health or not.

Answer 1 · 2022-08-08T15:19:52.000Z

I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.

I don't think though that alerts are the right trigger for this.

I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.
Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.
OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.
There is currently work being done, adding an agent mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.

As I said, I do like this effort! Maybe as a first step its worth having a discussion what operator health actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.

Answer 2 · 2022-08-23T12:33:51.000Z

I very much like the idea of having a standard way of reporting operator health that has a low barrier of entry.

I don't think though that alerts are the right trigger for this.

I agree that operators already vary in how they implement health metrics. I presume asking developers to add alert labels signifying impact on operator health will be just as varied.

@jan--f Thank you for the review.

I think that the developer that creates the alert knows if the alert impact the health of the whole operator. Its even something that they must explain in the "Impact" section in the alerts runbooks.

I think we can consider how to implement this and what the label should be called, but its a very important label for the alert itself, regardless of the health metric.

Critical alerts definition from the OCP docs:

For alerting current and impending disaster situations. These alerts page an SRE. The situation should warrant waking someone in the middle of the night.

Reserve critical level alerts only for reporting conditions that may lead to
loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts,
unless they would result in either of those conditions. Configure critical level
alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.

Having critical alerts for an operator means by definition that the operator is unhealthy and should be "Red".
There may be warning alerts that also impact the operator health but don't impact the cluster as a whole or don't cause loss of data.

Adding a label that indicates that there is an health issue for the operator can help with creating work pipelines in alerts manager and indicate they are important to fix and can turn the operator health to "Yellow".
So I agree that the new label should be well defined and mean that the alert impacts the operator health.

The added value of the label would be the ability to calculate the operator health in a generic way.

Existing alerts often refer to the operands, rarely the operator itself. Unhealthy operands (i.e. firing alerts) don't necessarily reflect on the operator health.

I think you are correct. The label should not be about whether its about the operand or operator, but mean that is impacts the operator health, Its ability to do the basic functionalities that is should.

OLM would have to query an endpoint providing the ALERTS metric. The question will be where an operators metrics will be kept. The OpenShift in-cluster stack is probably what comes to mind, but that doesn't have to be universally true.

That is true. But the health label will have value even without the health metric.
All the metrics that we provide are based on the Prometheus operator. We do have the basic metric from OLM.
I look at the operator health metric like the other metrics that we provide. Is that not the correct way?
Its the only way for us to get data that is not instant, but also looks at the system over time.
Since the system is built to fix itself, having alerts based metric would allow us to see issues that the system was unable to fix.

There is currently work being done, adding an agent mode for prometheus (prometheus already has it, prometheus-operator support is a wip). A prometheus in agent mode has no local querying capabilities, i.e. no local alerting in which case this approach will no longer provide insight.

That is very good to know. Thank you.
It means that the data is being collected to the remote Prometheus right?
Wouldn't the remote Prometheus still have the alerting capabilities?
Would the OCP UI query the remote instance?

As I said, I do like this effort! Maybe as a first step its worth having a discussion what operator health actually means as opposed to health of the operands or a particular CR instance? A CR might not be healthy (and it might expose that through its status subresrouce) but the operator that reconciles changes to that CR might be running just fine. To me its worth mapping out the difference.

Yes. Will be happy to have a discussion about this.

Answer 3 · 2022-08-30T11:33:58.000Z

@jan--f I would appreciate it if you could review my comments above and add others that can review this proposal.

Answer 4 · 2022-10-04T01:16:02.000Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

Answer 5 · 2022-10-04T10:42:02.000Z

/remove-lifecycle stale

Answer 6 · 2022-11-02T01:15:33.000Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

Answer 7 · 2022-11-02T08:06:06.000Z

/remove-lifecycle stale

Answer 8 · 2022-11-08T11:04:56.000Z

Created the following PR with more details #1280

Answer 9 · 2022-12-07T01:15:37.000Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

Answer 10 · 2022-12-07T10:49:01.000Z

/remove-lifecycle stale

Answer 11 · 2022-12-07T17:53:50.000Z

@sradco, I've seen this PR go stale a few times now. Is this on the roadmap for an upcoming release? If so, who needs to be reviewing it so we can move forward? If not, let's let it close until we're ready to work on it so we can keep the active review list cleared.

Answer 12 · 2022-12-07T17:59:01.000Z

Hi @dhellmann, I created a PR, #1280 and asked for reviews based on @jan--f suggestion.
I would appreciate it if others could look at it.

I believe this can bring value to Red Hat customers on the UI side and also to community operators that work with the Prometheus stack, since the labels that are added can be used for better routing in Alertmanager.

Answer 13 · 2023-01-05T01:15:19.000Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

Answer 14 · 2023-01-12T08:45:44.000Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 15 · 2023-01-20T00:16:03.000Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Answer 16 · 2023-01-20T00:16:22.000Z

@openshift-bot: Closing this issue.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.