NFD 4.10 / broken ServiceMonitor

Question

NFD 4.10 / broken ServiceMonitor

Closed this issue 2 years ago · 9 comments

Hi,

I'm now testing the NFD operator (current csv: nfd.4.10.0-202205120735, although I also see it replaces some nfd.4.10.0-202204251358. cluster was installed 9 days ago).

Looking forward to upgrading several OCP clusters from 4.8 to 4.10, I'm going through Prometheus alerts.
I see one regarding Prometheus being unable to scrape for metrics, out of some "nfd-controller-manager-metrics-service" Service.

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
{"status":"success","data":{"alerts":[{"labels":{"alertname":"TargetDown","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","service":"nfd-controller-manager-metrics-service","severity":"warning"},"annotations":{"description":"100% of the nfd-controller-manager-metrics-service/nfd-controller-manager-metrics-service targets in openshift-operators namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.","summary":"Some targets were not reachable from the monitoring server for an extended period of time."},"state":"firing","activeAt":"2022-05-17T14:33:11Z","value":"1e+02"}, ...

In the openshift-operators namespace, I found a "nfd-controller-manager-metrics-monitor" ServiceMonitor.
Defined with the following:

spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    path: /metrics
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: nfd-controller-manager-metrics-service.openshift-nfd.svc
  selector:
    matchLabels:
      control-plane: controller-manager

The thing is: my nfd operator is installed in the "openshift-operators" namespace.
The tlsConfig.serverName does not match (nfd-controller-manager-metrics-service.openshift-nfd.svc, should be nfd-controller-manager-metrics-service.openshift-operators.svc).
Shouldn't this configuration reflect whichever namespace operator is installed into?

Later on, as the alert wouldn't go away, I noticed there are two ServiceMonitors, with the exact same configuration.
Previous remark applies to the "controller-manager-metrics-monitor".
Although there's no reason to have two configurations. Doesn't sound related to CSV upgrade, creation timestamps almost match:

get servicemonitor -n openshift-operators -o yaml controller-manager-metrics-monitor nfd-controller-manager-metrics-monitor |grep creationTimest
    creationTimestamp: "2022-05-17T14:30:49Z"
    creationTimestamp: "2022-05-17T14:30:50Z"

Fixing the spec.tlsConfig.serverName on both ServiceMonitors, I can confirm Prometheus no longer complains about those metrics

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
[nfd servicemonitor alert is gone]

prometheus-pod$ curl "http://127.0.0.1:9090/api/v1/query?query=nfd_degraded_info"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nfd_degraded_info","container":"kube-rbac-proxy","endpoint":"https","instance":"10.94.13.23:8443","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","pod":"nfd-controller-manager-747c569f45-gdvdn","service":"nfd-controller-manager-metrics-service"},"value":[1653645665.843,"0"]}]}}

Thanks.

Answer 1 · 2022-07-14T08:26:00.000Z

Hey?
So? Is there anyone here, maintaining that piece of code?
It's been almost two months now, that I get monitoring complaining about this.
Any plan to fix?

Still affects nfd.4.10.0-202206291026

Answer 2 · 2022-10-03T16:03:22.000Z

still affects nfd.4.10.0-202208241855

Answer 3 · 2023-01-02T01:01:04.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 4 · 2023-01-03T10:39:55.000Z

This is still an issue

We have two servicemonitors. Each operator upgrade breaks the targeted service name.

And sidenote: the labelSelector in your Service can match other controllers. As I recently saw: the gitops operator controller also has some "control-plane=controller-manager" label. Ain't a big deal, at least in this case, given gitops does not listen on such a port. Still can be confusing at least, and could break eventually.

Answer 5 · 2023-01-03T22:18:32.000Z

It is sad. To see how openshift went from upstream-first, to this....
Used to be, we could get bugs fixed, reporting them in github.
While here, we get the full shouting-into-the-void experience.

Should we understand that, despite the most of its sources published, openshift is slowly moving away from opensource?

Answer 6 · 2023-02-04T00:30:19.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 7 · 2023-02-04T21:36:15.000Z

now running 4.10.0-202212061900 and 4.11.0-202212070335

Obviously, I'm still waiting for an answer
ServiceMonitor is still broken.

/remove-lifecycle stale

Answer 8 · 2023-03-07T00:00:29.000Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Answer 9 · 2023-03-07T00:01:31.000Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.