openshift/cluster-nfd-operator

NFD 4.10 / broken ServiceMonitor

Closed this issue · 9 comments

Hi,

I'm now testing the NFD operator (current csv: nfd.4.10.0-202205120735, although I also see it replaces some nfd.4.10.0-202204251358. cluster was installed 9 days ago).

Looking forward to upgrading several OCP clusters from 4.8 to 4.10, I'm going through Prometheus alerts.
I see one regarding Prometheus being unable to scrape for metrics, out of some "nfd-controller-manager-metrics-service" Service.

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
{"status":"success","data":{"alerts":[{"labels":{"alertname":"TargetDown","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","service":"nfd-controller-manager-metrics-service","severity":"warning"},"annotations":{"description":"100% of the nfd-controller-manager-metrics-service/nfd-controller-manager-metrics-service targets in openshift-operators namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.","summary":"Some targets were not reachable from the monitoring server for an extended period of time."},"state":"firing","activeAt":"2022-05-17T14:33:11Z","value":"1e+02"}, ...

In the openshift-operators namespace, I found a "nfd-controller-manager-metrics-monitor" ServiceMonitor.
Defined with the following:

spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    path: /metrics
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: nfd-controller-manager-metrics-service.openshift-nfd.svc
  selector:
    matchLabels:
      control-plane: controller-manager

The thing is: my nfd operator is installed in the "openshift-operators" namespace.
The tlsConfig.serverName does not match (nfd-controller-manager-metrics-service.openshift-nfd.svc, should be nfd-controller-manager-metrics-service.openshift-operators.svc).
Shouldn't this configuration reflect whichever namespace operator is installed into?

Later on, as the alert wouldn't go away, I noticed there are two ServiceMonitors, with the exact same configuration.
Previous remark applies to the "controller-manager-metrics-monitor".
Although there's no reason to have two configurations. Doesn't sound related to CSV upgrade, creation timestamps almost match:

get servicemonitor -n openshift-operators -o yaml controller-manager-metrics-monitor nfd-controller-manager-metrics-monitor |grep creationTimest
    creationTimestamp: "2022-05-17T14:30:49Z"
    creationTimestamp: "2022-05-17T14:30:50Z"

Fixing the spec.tlsConfig.serverName on both ServiceMonitors, I can confirm Prometheus no longer complains about those metrics

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
[nfd servicemonitor alert is gone]

prometheus-pod$ curl "http://127.0.0.1:9090/api/v1/query?query=nfd_degraded_info"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nfd_degraded_info","container":"kube-rbac-proxy","endpoint":"https","instance":"10.94.13.23:8443","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","pod":"nfd-controller-manager-747c569f45-gdvdn","service":"nfd-controller-manager-metrics-service"},"value":[1653645665.843,"0"]}]}}

Thanks.

Hey?
So? Is there anyone here, maintaining that piece of code?
It's been almost two months now, that I get monitoring complaining about this.
Any plan to fix?

Still affects nfd.4.10.0-202206291026

still affects nfd.4.10.0-202208241855

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

This is still an issue

We have two servicemonitors. Each operator upgrade breaks the targeted service name.

And sidenote: the labelSelector in your Service can match other controllers. As I recently saw: the gitops operator controller also has some "control-plane=controller-manager" label. Ain't a big deal, at least in this case, given gitops does not listen on such a port. Still can be confusing at least, and could break eventually.

It is sad. To see how openshift went from upstream-first, to this....
Used to be, we could get bugs fixed, reporting them in github.
While here, we get the full shouting-into-the-void experience.

Should we understand that, despite the most of its sources published, openshift is slowly moving away from opensource?

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

now running 4.10.0-202212061900 and 4.11.0-202212070335

Obviously, I'm still waiting for an answer
ServiceMonitor is still broken.

/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.