[kube-prometheus-stack] PrometheusDuplicateTimestamps alerts firing after upgrade to 57.2.1
lefterisALEX opened this issue · 2 comments
Describe the bug a clear and concise description of what the bug is.
After upgrade kube-prometheus-stack
from 56.21.4
to 57.2.1
we alerts about PrometheusDuplicateTimestamps
start firing.
What's your helm version?
v3.15.2
What's your kubectl version?
1.30.1
Which chart?
kube-prometheus-stack
What's the chart version?
57.2.1
What happened?
After upgrading from kube-prometheus-stack chart from 56.21.4
to 57.2.1
we start receiving alerts about PrometheusDuplicateTimestamps
.
I the logs of prometheus pods i see:
stern prometheus-kube-prometheus-stack-prometheus-0 --tail 10 -i "scrape manager"
+ prometheus-kube-prometheus-stack-prometheus-0 › prometheus
+ prometheus-kube-prometheus-stack-prometheus-0 › thanos-sidecar
+ prometheus-kube-prometheus-stack-prometheus-0 › config-reloader
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:08:56.277Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:09:26.289Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:09:56.250Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:10:26.256Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:10:56.313Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:11:26.255Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:11:56.339Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:12:26.204Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:12:56.232Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
prometheus-kube-prometheus-stack-prometheus-0 prometheus ts=2024-06-25T08:13:26.334Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/kube-system/kube-prometheus-stack-kube-state-metrics/0 target=http://100.81.77.2:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=76
What you expected to happen?
No receive PrometheusDuplicateTimestamps
alerts
How to reproduce it?
Install kube-prometheus-stack
with the provided values.yaml
Enter the changed values of values.yaml?
USER-SUPPLIED VALUES:
alertmanager:
alertmanagerSpec:
alertmanagerConfigNamespaceSelector: {}
alertmanagerConfigSelector:
matchLabels:
xxxxxx-system: "true"
externalUrl: https://alertmanager.xxxxxx-dev1.np.aws.company.yyy
podAntiAffinity: hard
podAntiAffinityTopologyKey: topology.kubernetes.io/zone
replicas: 3
resources:
requests:
cpu: 20m
memory: 64Mi
retention: 2h
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
config:
global:
resolve_timeout: 10m
inhibit_rules:
- equal:
- namespace
- alertname
source_matchers:
- severity = "critical"
target_matchers:
- severity =~ "warning|info"
- equal:
- namespace
- alertname
source_matchers:
- severity = "warning"
target_matchers:
- severity = "info"
- equal:
- namespace
source_matchers:
- alertname = "InfoInhibitor"
target_matchers:
- severity = "info"
receivers:
- name: blackhole
- name: slack_default
slack_configs:
- actions:
- text: "Runbook :green_book:"
type: button
url: '{{ template "slack.monzo.runbook" . }}'
- text: "Query :mag:"
type: button
url: "{{ (index .Alerts 0).GeneratorURL }}"
- text: "Dashboard :grafana:"
type: button
url: "{{ (index .Alerts 0).Annotations.dashboard }}"
- text: "Silence :no_bell:"
type: button
url: '{{ template "__alert_silence_link" . }}'
- text: '{{ template "slack.monzo.link_button_text" . }}'
type: button
url: "{{ .CommonAnnotations.link_url }}"
api_url: https://hooks.slack.com/services/xxxx
color: '{{ template "slack.monzo.color" . }}'
icon_emoji: '{{ template "slack.monzo.icon_emoji" . }}'
send_resolved: true
text: '{{ template "slack.monzo.text" . }}'
title: '[xxxxxx-dev1] {{ template "slack.monzo.title" . }}'
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: blackhole
repeat_interval: 1h
routes:
- matchers:
- alertname = "InfoInhibitor"
receiver: blackhole
- matchers:
- alertname = "Watchdog"
receiver: blackhole
- continue: true
matchers:
- severity =~ "^(critical|warning)$"
receiver: slack_default
ingress:
enabled: true
hosts:
- alertmanager.xxxxxx-dev1.np.aws.company.yyy
pathType: ImplementationSpecific
paths:
- /
podDisruptionBudget:
enabled: true
minAvailable: 2
templateFiles:
monzo-slack-templates.tmpl: |-
# This builds the silence URL. We exclude the alertname in the range
# to avoid the issue of having trailing comma separator (%2C) at the end
# of the generated URL
{{ define "__alert_silence_link" -}}
{{ .ExternalURL }}/#/silences/new?filter=%7B
{{- range .CommonLabels.SortedPairs -}}
{{- if ne .Name "alertname" -}}
{{- .Name }}%3D"{{- .Value -}}"%2C%20
{{- end -}}
{{- end -}}
alertname%3D"{{ .CommonLabels.alertname }}"%7D
{{- end }}
{{ define "__alert_severity_prefix_title" -}}
{{ if ne .Status "firing" -}}
:white_check_mark:
{{- else if eq .CommonLabels.severity "critical" -}}
:fire:
{{- else if eq .CommonLabels.severity "warning" -}}
:warning:
{{- else if eq .CommonLabels.severity "info" -}}
:information_source:
{{- else -}}
:question:
{{- end }}
{{- end }}
{{/* First line of Slack alerts */}}
{{ define "slack.monzo.title" -}}
[{{ .Status | toUpper -}}
{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{- end -}}
] {{ template "__alert_severity_prefix_title" . }} {{ .CommonLabels.alertname }}
{{- end }}
{{/* Color of Slack attachment (appears as line next to alert )*/}}
{{ define "slack.monzo.color" -}}
{{ if eq .Status "firing" -}}
{{ if eq .CommonLabels.severity "warning" -}}
warning
{{- else if eq .CommonLabels.severity "critical" -}}
danger
{{- else -}}
#439FE0
{{- end -}}
{{ else -}}
good
{{- end }}
{{- end }}
{{/* Emoji to display as user icon (custom emoji supported!) */}}
{{ define "slack.monzo.icon_emoji" }}:prometheus:{{ end }}
{{/* The test to display in the alert */}}
{{ define "slack.monzo.text" -}}
{{ range .Alerts }}
{{- if .Annotations.message }}
{{ .Annotations.message }}
{{- end }}
{{- if .Annotations.description }}
{{ .Annotations.description }}
{{- end }}
{{- end }}
{{- end }}
{{ define "slack.monzo.link_button_text" -}}
{{- if .CommonAnnotations.link_text -}}
{{- .CommonAnnotations.link_text -}}
{{- else -}}
Link
{{- end }} :link:
{{- end }}
{{ define "slack.monzo.runbook" -}}
{{- if (index .Alerts 0).Annotations.runbook -}}
{{- (index .Alerts 0).Annotations.runbook -}}
{{- else -}}
{{- (index .Alerts 0).Annotations.runbook_url -}}
{{- end }}
{{- end }}
commonLabels:
xxxxxx-system: "true"
customRules:
NodeDiskIOSaturation:
for: 15m
severity: warning
defaultRules:
appNamespacesTarget: kube-system|cattle-system|kyverno|keda|kubecost|ingress-nginx|pod-security-webhook|qualys
create: true
rules:
etcd: false
kubeControllerManager: false
kubeProxy: false
kubeSchedulerAlerting: false
kubeSchedulerRecording: false
kubernetesResources: false
grafana:
enabled: false
kube-state-metrics:
collectors:
- certificatesigningrequests
- configmaps
- cronjobs
- daemonsets
- deployments
- endpoints
- horizontalpodautoscalers
- ingresses
- jobs
- limitranges
- mutatingwebhookconfigurations
- namespaces
- networkpolicies
- nodes
- persistentvolumeclaims
- persistentvolumes
- poddisruptionbudgets
- pods
- replicasets
- replicationcontrollers
- resourcequotas
- secrets
- services
- statefulsets
- storageclasses
- validatingwebhookconfigurations
- volumeattachments
metricLabelsAllowlist:
- nodes=[*]
- namespaces=[*]
- pods=[app,instance,component,app.kubernetes.io/name,app.kubernetes.io/instance,app.kubernetes.io/component]
prometheus:
monitor:
additionalLabels:
xxxxxx-system: "true"
relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: kubernetes_node
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
kubeScheduler:
enabled: false
kubelet:
enabled: true
serviceMonitor:
cAdvisorMetricRelabelings:
- action: drop
regex: container_memory_failures_total
sourceLabels:
- __name__
- action: drop
regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
sourceLabels:
- __name__
- action: drop
regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
sourceLabels:
- __name__
- action: drop
regex: container_memory_(mapped_file|swap)
sourceLabels:
- __name__
- action: drop
regex: container_(file_descriptors|tasks_state|threads_max)
sourceLabels:
- __name__
- action: drop
regex: container_spec.*
sourceLabels:
- __name__
- action: drop
regex: .+;
sourceLabels:
- id
- pod
prometheus:
ingress:
enabled: true
hosts:
- prometheus.xxxxxx-dev1.np.aws.company.yyy
pathType: ImplementationSpecific
paths:
- /
podDisruptionBudget:
enabled: true
minAvailable: 2
prometheusSpec:
disableCompaction: true
externalLabels:
cluster: xxxxxx-dev1
externalUrl: https://prometheus.xxxxxx-dev1.np.aws.company.yyy
nodeSelector:
company.yyy/role: system
podAntiAffinity: hard
podAntiAffinityTopologyKey: topology.kubernetes.io/zone
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
xxxxxx-system: "true"
podMonitorSelectorNilUsesHelmValues: false
probeNamespaceSelector: {}
probeSelector:
matchLabels:
xxxxxx-system: "true"
probeSelectorNilUsesHelmValues: false
replicas: 3
resources:
requests:
cpu: 500m
memory: 5000M
retention: 2h
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
xxxxxx-system: "true"
ruleSelectorNilUsesHelmValues: false
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
xxxxxx-system: "true"
serviceMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 32Gi
thanos:
image: quay.io/thanos/thanos:v0.35.1
objectStorageConfig:
existingSecret:
key: config
name: thanos-sidecar-storage-config
tolerations:
- key: node.kubernetes.io/system
operator: Exists
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::1265453742:role/xxxxxx-monitoring-thanos-xxxx-dev1
prometheus-node-exporter:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
extraArgs:
- --collector.ethtool
- --collector.ethtool.device-include=^eth.*$
- --collector.ethtool.metrics-include=^.*_exceeded$
priorityClassName: system-node-critical
prometheus:
monitor:
additionalLabels:
xxxxxx-system: "true"
resources:
requests:
cpu: 20m
memory: 30Mi
tolerations:
- operator: Exists
updateStrategy:
rollingUpdate:
maxUnavailable: 100%
type: RollingUpdate
prometheusOperator:
admissionWebhooks:
patch:
securityContext:
seccompProfile:
type: RuntimeDefault
resources:
requests:
cpu: 30m
memory: 256Mi
Enter the command that you execute and failing/misfunctioning.
we use terraform:
resource "helm_release" "kube_prometheus_stack" {
count = local.enable_prometheus ? 1 : 0
name = "kube-prometheus-stack"
chart = "kube-prometheus-stack"
version = "57.2.1"
namespace = "kube-system"
repository = "https://prometheus-community.github.io/helm-charts"
max_history = 5
values = [
....
]}
Anything else we need to know?
no
That's a bug in ksm, see this PR kubernetes/kube-state-metrics#2257
We've also had this issue after upgrading the kube-prometheus-stack
and for us it turned out to be a bug in an Ingress resource which created duplicate metrics. For some reasons with earlier versions of kube-prometheus-stack
we didn't notice that issue.
Maybe you could check whether you've got a similar issue by creating a port-forward to the kube-state-metrics
service, then downloading the metrics from /metrics
and checking the file for duplicates.
Looking at your deployment, it should be similar to:
kubectl -n kube-system port-forward svc/kube-prometheus-stack-kube-state-metrics 8080
curl -O http://localhost:8080/metrics
cat metrics | sort > metrics_sorted
cat metrics_sorted | uniq > metrics_uniq
diff metrics_sorted metrics_uniq
Best,
Max