prometheus-community/helm-charts

[kube-prometheus-stack] Lost Prometheus targets after upgrade to version 63

kallaics opened this issue ยท 23 comments

Describe the bug a clear and concise description of what the bug is.

We have PodMonitor, ServiceMonitor and ScrapeConfig resources and after the upgrade these resources are completely disappeared from the targets list in Prometheus and not collect the metrics. I spent a morning to find any relevant change in the release notes, but I don't found anything. The only solution was the downgrade.
All affected resources are included through Selector.matchLabels

Worked well on version: 62.7.0
Not worked on version: 63.0.0 and 63.0.1

Prometheus configuration

prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/whitelist-source-range: <ip ranges>
    labels: {}
    hosts:
      - prometheus.<domain>.<tld>
    paths:
      - /
    pathType: ImplementationSpecific
    tls: []
  prometheusSpec:
    prometheusExternalLabelNameClear: true
    replicaExternalLabelNameClear: true
    podMonitorSelector:
      matchLabels:
        app.kubernetes.io/part-of: flux
    serviceMonitorSelector:
      matchLabels:
        monitoring: prometheus
    scrapeConfigSelector:
      matchLabels:
        monitoring: prometheus
    secrets:
      - ca-certs
    additionalArgs:
      - name: "web.enable-admin-api"
    resources:  
      requests:
        memory: 6Gi

Scrape configuration (I know it is alpha status, so we are testing on non-production environment only)

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: <Kubernetes cluster name>
  namespace: monitoring
  labels:
    monitoring: prometheus-scrape
spec:
  scrapeInterval: 15s
  scrapeTimeout: 12s
  honorLabels: true
  metricsPath: '/federate'
  scheme: HTTPS
  tlsConfig:
    ca:
      secret:
        key: internal-ca.pem
        name: ca-certs
  basicAuth:
    username:
      key: username
      name: monitoring-federate-auth
    password:
      key: password
      name: monitoring-federate-auth
  params:
    'match[]':
      - '{__name__=~".+"}'
  staticConfigs:
    - labels:
        job: <Kubernetes cluster name>
        env: dev
        cluster_name: <Kubernetes cluster name>
      targets:
        - 'prometheus.<domain2>.<tld>'

Thanks guys the help.

What's your helm version?

version.BuildInfo{Version:"v3.16.1", GitCommit:"5a5449dc42be07001fd5771d56429132984ab3ab", GitTreeState:"dirty", GoVersion:"go1.23.1"}

What's your kubectl version?

Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.29.7

Which chart?

kube-prometheus-stack

What's the chart version?

63.0.0, 63.1.0

What happened?

After Helm chart upgrade the most of the target disappeared from Prometheus.
Affected resources:

  • PodMonitor
  • ServiceMonitor
  • ScrapeConfig

What you expected to happen?

Not disappear the targets from Prometheus.

How to reproduce it?

Install Prometheus from kube-prometheus-stack version 62.7.0
Install PodMonitor resource in different namespace from the Prometheus
Install ServiceMonitor, ScrapeConfig resources in same namespace with Prometheus
After all target works well, upgrade the cluster to 63.0.0 or 63.1.0

Enter the changed values of values.yaml?

None

Enter the command that you execute and failing/misfunctioning.

helm install prometheus prometheus-community/kube-prometheus-stack --values values.yaml

Anything else we need to know?

The Deployment happened via FluxCD v 2.3.0 (latest stable).
Full Helm chart reinstall from scratch not solved the issue.
All CRDs are on latest version for the Helm chart.

Same here , i had to revert !

@kallaics did you check the matching label? The default values now set release: release_name as default label and I had to insert this label on all service monitors

You are right, I forgot to add this step to the reproducing steps. The custom matching label is a difference from the basic settings.

I double checked the matching labels. My custom matching labels were added to the resources.

Helm chart configuration not changed during the upgrade on my side, and after downgrade everything was okay again.

Instead:

    scrapeConfigSelector:
      matchLabels:
        monitoring: prometheus

use

    scrapeConfigSelector:
      matchLabels:
        release: null
        monitoring: prometheus

Ref:

## To remove the release label from the matchLabels condition, explicit set release to null.

I would like to remove the remove the default matchLabels.

The following works as expected:

helm template foo kube-prometheus-stack \
    --repo https://prometheus-community.github.io/helm-charts \
    >z0

cat >values.yaml <<EOF
prometheus:
  prometheusSpec:
    ruleSelector:
      matchLabels: null
    serviceMonitorSelector:
      matchLabels: null
EOF

helm template foo kube-prometheus-stack \
    --repo https://prometheus-community.github.io/helm-charts \
    --values values.yaml \
    >z1
$ diff z0 z1
--- z0
+++ z1
@@ -2273,8 +2273,7 @@
   routePrefix: "/"
   serviceAccountName: foo-kube-prometheus-stack-prometheus
   serviceMonitorSelector: 
-    matchLabels:
-      release: 'foo'
+    {}
   serviceMonitorNamespaceSelector: 
     {}
   podMonitorSelector: 
@@ -2297,8 +2296,7 @@
   ruleNamespaceSelector: 
     {}
   ruleSelector: 
-    matchLabels:
-      release: 'foo'
+    {}
   scrapeConfigSelector: 
     matchLabels:
       release: 'foo'

But if I use it as a subchart:

cat >Chart.yaml <<EOF
---
apiVersion: v2
name: system
type: application
version: 0.0.0
appVersion: 0.0.0
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
EOF

cat >values.yaml <<EOF
kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null
      serviceMonitorSelector:
        matchLabels: null
EOF

helm template foo . \
    --values values.yaml \
    >z1
$ diff z0 z1 | grep matchLabels
<empty>

in other words the matchLabels: null is skipped.

Instead:

cat >values.yaml <<EOF
kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: a
      serviceMonitorSelector:
        matchLabels: a
EOF

helm template foo . \
    --values values.yaml \
    >z1
$ diff z0 z1 | grep matchLabels -A2
-    matchLabels:
-      release: 'foo'
+    matchLabels: a
   serviceMonitorNamespaceSelector: 
     {}
--
-    matchLabels:
-      release: 'foo'
+    matchLabels: a
   scrapeConfigSelector: 
     matchLabels:
       release: 'foo'

Maybe due to helm/helm#12637?

@yellowhat Correct, We are aware that there is an helm issue which add an incompatibility with sub-charts. ref: #4818

Could you try this:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig:
        scrapeConfigSelector:
          matchLabels:
            monitoring: prometheus

helm template may looks goofy, but due how duplicate keys on YAML works, the config from additionalConfig should be prefer.

Mention that this is maybe a short-term workaround. We are looking forward into a solution.

Hi guys,

I never thought, during the weekend will somebody find the root cause and can provide a workaround as well. Big high five for every participants!

I agree with @yellowhat recommendation to apply the changes. I have no clue, how much user use the release label, but the idea was basically good and useful for Helm chart validation.

@jkroepke It is good catch, but it will cause probably "major" change. I think this also would be great to move these items to there. We will see it can be there in the next version or just the next major release.

@yellowhat Correct, We are aware that there is an helm issue which add an incompatibility with sub-charts. ref: #4818

Could you try this:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig:
        scrapeConfigSelector:
          matchLabels:
            monitoring: prometheus

helm template may looks goofy, but due how duplicate keys on YAML works, the config from additionalConfig should be prefer.

Mention that this is maybe a short-term workaround. We are looking forward into a solution.

From what I can see, the default value is already empty:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      additionalConfig: {}

How can I remove the matchLabels options?

If scrapeConfigSelector is configured via additionalConfig , it should fully override the existing scrapeConfigSelector

@jkroepke the helm issue with null values to subcharts that would hit in this case is afaik already fixed, and I was able to upgrade to this new version without any changes to my deployment:

https://github.com/broersma-forslund/homelab/pull/570/files
See diff in this comment: broersma-forslund/homelab#570 (comment)

I suspect users with issues are using older versions of helm.

May I ask which version are you using?

I am on 3.16.1.

May I ask which version are you using?

I am on 3.16.1.

Mhm weird, I'm on v3.15.2. But it's clearly working for my scenario.

It works only, if it's placed in the default values.yaml of the umbrella chart. I see in your diff, thats the case.

If the override comes from additional values, like via -f, the issue still occurs.

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

name: monitoring
version: "0.0.1"
appVersion: "0.1.0"
description: Monitoring platform
type: application
apiVersion: v2
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
kube-prometheus-stack:
  ## Manages Prometheus and Alertmanager components
  ##
  prometheusOperator:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null # These nulls can never override a non-null value in values.yaml in https://prometheus-community.github.io/helm-charts
      serviceMonitorSelector:
        matchLabels: null 
      podMonitorSelector:
        matchLabels: null
      probeSelector:
        matchLabels: null
      scrapeConfigSelector:
        matchLabels: null

so with this setup it is impossible to override the value in e.g. https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3708-L3710

See various helm issues helm/helm#12741 helm/helm#13222 helm/helm#12511

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

As explained above this does work as overrides in the parent chart default values but apparently not in values overrides provided as separate values to the parent chart.

Hi all,

this morning, I discussed with @QuentinBisson possible option to recover from the failed v63 release. Both proposals will result into an other breaking change. But it is what it is.

Proposal 1 - Switch to plain strings (Vote with ๐ŸŽ‰ )

All Selectors become plain strings.

Instead:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      scrapeConfigSelector:
        matchLabels:
          monitoring: prometheus

This format will be used:

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      scrapeConfigSelector: |-
        matchLabels:
          monitoring: prometheus

This would avoid the required to set values to zero while giving the full flexibility to fill values. An empty string mean null in that case. If set, the default value "release: {{ $.Release.Name }}" would be overwritten without setting anything else to null.

The plain string would be passed through tpl as well.

On template site, it will be changes from

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: {{ tpl (toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector) . | nindent 4 }}
{{- end }}

to

{{- with .Values.prometheus.prometheusSpec.ruleNamespaceSelector }}
  ruleNamespaceSelector: {{ tpl . $ | nindent 4 }}
{{- end }}

Proposal 2 - Revert to old values (Vote with ๐Ÿš€ )

Basically revert #4818 without any future improvements.

How it looks like to introduce a variable to apply the default label (release: {{ .Release.Name }} ) ?
It is my proposal. It is not breaking the simple yaml structure and probably better to read it.

In my case will ignore therelease label:

Option A:

Added new variable to ruleconfigSelector level.

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleConfigSelector:
        includeReleaseLabel: false    # Default: true and will apply the 'release' label, when matchLabels is null, will ignored
        matchLabels:                         # Allow to configure more additional label too with or without release label.   
          monitoring: prometheus

The template:
This logic can be implemented in _helpers.tpl and just include it.

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: 
    {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.includeReleaseLabel) }}
    release: {{ .Release.Name }}
  {{- end }}
  {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel) }}
  {{ toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel . | nindent 4 }}
  {{- end }}
{{- end }}

Option B:

the ruleConfigSelector part will untouched and new variable added to the PrometheusSpec level.

kube-prometheus-stack:
  prometheus:
    prometheusSpec:
      ruleConfigWithReleaseLabel: false    # Default: true and will apply the 'release' label, when matchLabels is null, will ignored
      ruleConfigSelector:
        matchLabels:                         # Allow to configure more additional label too with or without release label.   
          monitoring: prometheus

The template:
This logic can be implemented in _helpers.tpl and just include it.

{{- if not (kindIs "invalid" .Values.prometheus.prometheusSpec.ruleNamespaceSelector) }}
  ruleNamespaceSelector: 
    {{- if .Values.prometheus.prometheusSpec.ruleConfigWithReleaseLabel) }}
    release: {{ .Release.Name }}
  {{- end }}
  {{- if .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel) }}
  {{ toYaml .Values.prometheus.prometheusSpec.ruleNamespaceSelector.matchLabel . | nindent 4 }}
  {{- end }}
{{- end }}

As far as I know helm does not support overriding a non-null value with a null value in a subchart, e.g.

name: monitoring
version: "0.0.1"
appVersion: "0.1.0"
description: Monitoring platform
type: application
apiVersion: v2
dependencies:
  - name: kube-prometheus-stack
    version: 63.1.0
    repository: https://prometheus-community.github.io/helm-charts
kube-prometheus-stack:
  ## Manages Prometheus and Alertmanager components
  ##
  prometheusOperator:
  prometheus:
    prometheusSpec:
      ruleSelector:
        matchLabels: null # These nulls can never override a non-null value in values.yaml in https://prometheus-community.github.io/helm-charts
      serviceMonitorSelector:
        matchLabels: null 
      podMonitorSelector:
        matchLabels: null
      probeSelector:
        matchLabels: null
      scrapeConfigSelector:
        matchLabels: null

so with this setup it is impossible to override the value in e.g. https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3708-L3710

See various helm issues helm/helm#12741 helm/helm#13222 helm/helm#12511

Same problem here. So basically this comment here does not work for us:

## To remove matchLabels from the selector condition, explicitly set matchLabels to null.

Instead of having to set it to empty string or null I would much prefer something more functional like for example being able to set it to ALL to match all labels or having a boolean property matchAllLabels or something similar. scrapeConfigSelectorNilUsesHelmValues was always a confusing property and setting a selector to "" does not make it clear that this means match ALL instead of match NONE.

It doesn't help that the prometheus operator crd itself is also unusual in that it supports three states (nil, empty and filled). I don't think the current kps chart supports all three states, does it?

Same, using special label to discover the targets, had to revert back to 62 cuz everything dissapeared ๐Ÿ˜ƒ

@rouke-broersma What do you mean about my proposal above? It can keep both option. Only issue you cannot use in "OR" relation the two settings (to keep the default release labels and I want to scrape my custom resources with different label(s) ).

@rouke-broersma What do you mean about my proposal above? It can keep both option. Only issue you cannot use in "OR" relation the two settings (to keep the default release labels and I want to scrape my custom resources with different label(s) ).

Your proposal does not functionally exactly match the actual scenarios. The actual scenario is not whether or not the release label is included. The actual scenario is one of these three:

  • Match NO Resources (nil value in prometheus-operator)
  • Match subset of resources based on labels (for example default release label, or custom label)
  • Match ALL Resources (empty array/object in prometheus-operator)

We decided to perform a rollback #4883

My motivation for using some of the 'cool' features is quite low because there may be other unknown bugs in the Helm ecosystem. I want to avoid another failed release.