Incorrect metrics values displayed on Google Cloud Managed Service for Prometheus
Closed this issue · 2 comments
I'm not sure if this is the right place to report this, but I'd like to report an issue I've encountered with Managed Prometheus. It's displaying the wrong metric value for certain metrics.
Reproduction steps
Create a GKE cluster with Autopilot mode:
gcloud container clusters create-auto test-cluster \
--region asia-northeast1 \
--project=nagasawa-test
Setup GKE authentication for kubectl:
gcloud container clusters get-credentials test-cluster --region asia-northeast1
Check if you can access the cluster:
❯ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gk3-test-cluster-default-pool-f186350b-d4s3 Ready <none> 3m56s v1.27.3-gke.100
Install kube-state-metrics using the community Helm chart (prometheus-community/kube-state-metrics):
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
cat <<EOF | helm upgrade kube-state-metrics prometheus-community/kube-state-metrics --version 5.11.0 --install -n gmp-public -f -
collectors:
- pods
extraManifests:
- apiVersion: monitoring.googleapis.com/v1
kind: ClusterPodMonitoring
metadata:
name: kube-state-metrics
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
endpoints:
- port: http
interval: 30s
EOF
Some of the Managed Prometheus Pods are always restarted on boot: (see GoogleCloudPlatform/prometheus-engine#472)
❯ kubectl get pods -n gke-gmp-system
NAME READY STATUS RESTARTS AGE
alertmanager-0 2/2 Running 2 (5m10s ago) 5m34s
collector-8smnv 2/2 Running 1 (6m2s ago) 7m24s
gmp-operator-695bbf877f-zjrbq 1/1 Running 0 5m42s
rule-evaluator-567694d5f-vhrg6 2/2 Running 2 (5m20s ago) 5m42s
Run the following query to see the restart count of Pods:
kube_pod_container_status_restarts_total{cluster="test-cluster", exported_namespace="gke-gmp-system"}
The Managed Prometheus console shows the incorrect restart counts of some Pods. For example, in this case, it shows the wrong restart counts for collector-8smnv
and rule-evaluator-567694d5f-vhrg6
. Using kubectl, it shows a restart count of 1 for collector-8smnv
, but the Managed Prometheus console shows 0.
Looking at the implementation of kube-state-metrics, it simply fetches the restart count from the Pod's .status.containerStatuses[*].restartCount
field. (source)
It appears that the restart count of these pods is 1 and 2 respectively.
❯ kubectl get pods -n gke-gmp-system collector-8smnv -ojsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.restartCount}{"\n"}'
config-reloader 1
prometheus 0
❯ kubectl get pods -n gke-gmp-system rule-evaluator-567694d5f-vhrg6 -ojsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.restartCount}{"\n"}'
config-reloader 2
evaluator 0
Get raw metrics from the `/metrics' endpoint of kube-state-metrics:
kubectl port-forward svc/kube-state-metrics -n gmp-public 8080:8080
Run the curl
command and check that it reports the correct value for the kube_pod_container_status_restarts_total
metric:
❯ curl -sSL http://localhost:8080/metrics | grep kube_pod_container_status_restarts_total | grep gke-gmp-system
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="gmp-operator-695bbf877f-zjrbq",uid="dea47b33-180a-4f3d-80b6-23783199d980",container="operator"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="collector-8smnv",uid="7adcd840-6d31-48d1-bf7d-b749cfa5f3e5",container="config-reloader"} 1
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="collector-8smnv",uid="7adcd840-6d31-48d1-bf7d-b749cfa5f3e5",container="prometheus"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="rule-evaluator-567694d5f-vhrg6",uid="d9a9d842-3a42-43bd-8530-e7b7dfa9cdd1",container="config-reloader"} 2
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="rule-evaluator-567694d5f-vhrg6",uid="d9a9d842-3a42-43bd-8530-e7b7dfa9cdd1",container="evaluator"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="alertmanager-0",uid="719171c5-0ef6-417e-85a4-44f2bec73c3f",container="alertmanager"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="alertmanager-0",uid="719171c5-0ef6-417e-85a4-44f2bec73c3f",container="config-reloader"} 2
It appears that the kube-state-metrics are reporting the correct value, but the Managed Prometheus is reporting the incorrect value.
@toVersus Thank you for your detailed issue report. This appears to be consistent with expected behavior.
https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#counter-sums
The raw values of counters are offset, due to skipping the first ingested point in the time series.
Ah, sorry for asking this question without looking at the troubleshooting guide.
Thanks for the pointer! I'll close the issue.