GoogleCloudPlatform/prometheus-engine

Incorrect metrics values displayed on Google Cloud Managed Service for Prometheus

Closed this issue · 2 comments

I'm not sure if this is the right place to report this, but I'd like to report an issue I've encountered with Managed Prometheus. It's displaying the wrong metric value for certain metrics.

Reproduction steps

Create a GKE cluster with Autopilot mode:

gcloud container clusters create-auto test-cluster \
    --region asia-northeast1 \
    --project=nagasawa-test

Setup GKE authentication for kubectl:

gcloud container clusters get-credentials test-cluster --region asia-northeast1

Check if you can access the cluster:

kubectl get nodes
NAME                                          STATUS   ROLES    AGE     VERSION
gk3-test-cluster-default-pool-f186350b-d4s3   Ready    <none>   3m56s   v1.27.3-gke.100

Install kube-state-metrics using the community Helm chart (prometheus-community/kube-state-metrics):

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

cat <<EOF | helm upgrade kube-state-metrics prometheus-community/kube-state-metrics --version 5.11.0 --install -n gmp-public -f -
collectors:
  - pods
extraManifests:
- apiVersion: monitoring.googleapis.com/v1
  kind: ClusterPodMonitoring
  metadata:
    name: kube-state-metrics
    labels:
      app.kubernetes.io/name: kube-state-metrics
      app.kubernetes.io/part-of: google-cloud-managed-prometheus
  spec:
    selector:
      matchLabels:
        app.kubernetes.io/name: kube-state-metrics
    endpoints:
    - port: http
      interval: 30s
EOF

Some of the Managed Prometheus Pods are always restarted on boot: (see GoogleCloudPlatform/prometheus-engine#472)

kubectl get pods -n gke-gmp-system
NAME                             READY   STATUS    RESTARTS        AGE
alertmanager-0                   2/2     Running   2 (5m10s ago)   5m34s
collector-8smnv                  2/2     Running   1 (6m2s ago)    7m24s
gmp-operator-695bbf877f-zjrbq    1/1     Running   0               5m42s
rule-evaluator-567694d5f-vhrg6   2/2     Running   2 (5m20s ago)   5m42s

Run the following query to see the restart count of Pods:

kube_pod_container_status_restarts_total{cluster="test-cluster", exported_namespace="gke-gmp-system"}

The Managed Prometheus console shows the incorrect restart counts of some Pods. For example, in this case, it shows the wrong restart counts for collector-8smnv and rule-evaluator-567694d5f-vhrg6. Using kubectl, it shows a restart count of 1 for collector-8smnv, but the Managed Prometheus console shows 0.

image

Looking at the implementation of kube-state-metrics, it simply fetches the restart count from the Pod's .status.containerStatuses[*].restartCount field. (source)

It appears that the restart count of these pods is 1 and 2 respectively.

❯ kubectl get pods -n gke-gmp-system collector-8smnv -ojsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.restartCount}{"\n"}'
config-reloader	1
prometheus	0

❯ kubectl get pods -n gke-gmp-system rule-evaluator-567694d5f-vhrg6 -ojsonpath='{range .status.containerStatuses[*]}{.name}{"\t"}{.restartCount}{"\n"}'
config-reloader	2
evaluator	0

Get raw metrics from the `/metrics' endpoint of kube-state-metrics:

kubectl port-forward svc/kube-state-metrics -n gmp-public 8080:8080

Run the curl command and check that it reports the correct value for the kube_pod_container_status_restarts_total metric:

❯ curl -sSL http://localhost:8080/metrics | grep kube_pod_container_status_restarts_total | grep gke-gmp-system
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="gmp-operator-695bbf877f-zjrbq",uid="dea47b33-180a-4f3d-80b6-23783199d980",container="operator"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="collector-8smnv",uid="7adcd840-6d31-48d1-bf7d-b749cfa5f3e5",container="config-reloader"} 1
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="collector-8smnv",uid="7adcd840-6d31-48d1-bf7d-b749cfa5f3e5",container="prometheus"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="rule-evaluator-567694d5f-vhrg6",uid="d9a9d842-3a42-43bd-8530-e7b7dfa9cdd1",container="config-reloader"} 2
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="rule-evaluator-567694d5f-vhrg6",uid="d9a9d842-3a42-43bd-8530-e7b7dfa9cdd1",container="evaluator"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="alertmanager-0",uid="719171c5-0ef6-417e-85a4-44f2bec73c3f",container="alertmanager"} 0
kube_pod_container_status_restarts_total{namespace="gke-gmp-system",pod="alertmanager-0",uid="719171c5-0ef6-417e-85a4-44f2bec73c3f",container="config-reloader"} 2

It appears that the kube-state-metrics are reporting the correct value, but the Managed Prometheus is reporting the incorrect value.

@toVersus Thank you for your detailed issue report. This appears to be consistent with expected behavior.
https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#counter-sums

The raw values of counters are offset, due to skipping the first ingested point in the time series.

Ah, sorry for asking this question without looking at the troubleshooting guide.
Thanks for the pointer! I'll close the issue.