prometheus-operator/kube-prometheus

process_start_time_seconds Metric Collected Twice Due to ServiceMonitor Changes in Commit 43f2094

Closed this issue · 1 comments

With the latest update commit 74e445a, the ServiceMonitor for the kube-apiserver has changed. This change is causing issues as the process_start_time_seconds metric is now being collected twice: once from the /metrics path and once from the /metrics/slis path. Additionally, it is somewhat confusing to see that the newly added configuration for scraping metrics from the /metrics/slis path is set to run every 5 seconds, while the /metrics path is scraped every 30 seconds.

  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 5s
    path: /metrics/slis
    port: https
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      serverName: kubernetes

Debug log from out Prometheus server

ts=2024-09-03T08:51:39.348Z caller=scrape.go:1760 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-apiserver/0 target=[https://10.34.28.164:443/metrics](https://10.34.28.164/metrics) msg="Out of order sample" series=process_start_time_seconds

Metrics in the related paths

rgarcia$ curl -s -k -H "Authorization: Bearer $(cat tokenfile-scn)" https://10.34.28.164:443/metrics/slis | grep process_start_time_seconds
# HELP process_start_time_seconds [ALPHA] Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72534614475e+09

rgarcia$ curl -s -k -H "Authorization: Bearer $(cat tokenfile-scn)" https://10.34.28.164:443/metrics | grep process_start_time_seconds
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72534615776e+09

It seems like the changes were introduced by this commit 43f2094 from @dgrisonnet.

I would suggest to drop process_start_time_seconds from the /metrics/slis path to avoid further issues.

Thank you for reporting, I didn't run into this issue whilst testing locally so I am glad you reported it.

The reasoning behind the different scrape interval is that the /metrics/slis endpoint exposes way less metrics but ones that are important to scrape at high frequency. You can read more about this fairly new endpoint in https://kubernetes.io/docs/reference/instrumentation/slis/.

It appears that the proess_start_time_seconds was intentionally added to address kubernetes/kubernetes#122520, but looking at it again, it seems that we might've been wrong.

I'll update the ServiceMonitor to drop the metric and follow-up in Kubernetes on whether we should fully drop this metric or not.