NFD 4.10 / broken ServiceMonitor
Closed this issue · 9 comments
Hi,
I'm now testing the NFD operator (current csv: nfd.4.10.0-202205120735, although I also see it replaces some nfd.4.10.0-202204251358. cluster was installed 9 days ago).
Looking forward to upgrading several OCP clusters from 4.8 to 4.10, I'm going through Prometheus alerts.
I see one regarding Prometheus being unable to scrape for metrics, out of some "nfd-controller-manager-metrics-service" Service.
prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
{"status":"success","data":{"alerts":[{"labels":{"alertname":"TargetDown","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","service":"nfd-controller-manager-metrics-service","severity":"warning"},"annotations":{"description":"100% of the nfd-controller-manager-metrics-service/nfd-controller-manager-metrics-service targets in openshift-operators namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.","summary":"Some targets were not reachable from the monitoring server for an extended period of time."},"state":"firing","activeAt":"2022-05-17T14:33:11Z","value":"1e+02"}, ...
In the openshift-operators namespace, I found a "nfd-controller-manager-metrics-monitor" ServiceMonitor.
Defined with the following:
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
path: /metrics
port: https
scheme: https
tlsConfig:
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
serverName: nfd-controller-manager-metrics-service.openshift-nfd.svc
selector:
matchLabels:
control-plane: controller-manager
The thing is: my nfd operator is installed in the "openshift-operators" namespace.
The tlsConfig.serverName does not match (nfd-controller-manager-metrics-service.openshift-nfd.svc, should be nfd-controller-manager-metrics-service.openshift-operators.svc).
Shouldn't this configuration reflect whichever namespace operator is installed into?
Later on, as the alert wouldn't go away, I noticed there are two ServiceMonitors, with the exact same configuration.
Previous remark applies to the "controller-manager-metrics-monitor".
Although there's no reason to have two configurations. Doesn't sound related to CSV upgrade, creation timestamps almost match:
get servicemonitor -n openshift-operators -o yaml controller-manager-metrics-monitor nfd-controller-manager-metrics-monitor |grep creationTimest
creationTimestamp: "2022-05-17T14:30:49Z"
creationTimestamp: "2022-05-17T14:30:50Z"
Fixing the spec.tlsConfig.serverName on both ServiceMonitors, I can confirm Prometheus no longer complains about those metrics
prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
[nfd servicemonitor alert is gone]
prometheus-pod$ curl "http://127.0.0.1:9090/api/v1/query?query=nfd_degraded_info"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nfd_degraded_info","container":"kube-rbac-proxy","endpoint":"https","instance":"10.94.13.23:8443","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","pod":"nfd-controller-manager-747c569f45-gdvdn","service":"nfd-controller-manager-metrics-service"},"value":[1653645665.843,"0"]}]}}
Thanks.
Hey?
So? Is there anyone here, maintaining that piece of code?
It's been almost two months now, that I get monitoring complaining about this.
Any plan to fix?
Still affects nfd.4.10.0-202206291026
still affects nfd.4.10.0-202208241855
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
This is still an issue
We have two servicemonitors. Each operator upgrade breaks the targeted service name.
And sidenote: the labelSelector in your Service can match other controllers. As I recently saw: the gitops operator controller also has some "control-plane=controller-manager" label. Ain't a big deal, at least in this case, given gitops does not listen on such a port. Still can be confusing at least, and could break eventually.
It is sad. To see how openshift went from upstream-first, to this....
Used to be, we could get bugs fixed, reporting them in github.
While here, we get the full shouting-into-the-void experience.
Should we understand that, despite the most of its sources published, openshift is slowly moving away from opensource?
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
now running 4.10.0-202212061900 and 4.11.0-202212070335
Obviously, I'm still waiting for an answer
ServiceMonitor is still broken.
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.