Unable to get metrics for resource CPU metrics on EKS

Question

Unable to get metrics for resource CPU metrics on EKS

Closed this issue 3 years ago · 12 comments

Hi,

Unfortunately I'm really struggling to work out why this won't pick up metrics for my service. Even with logVerbosity: 3 I can't get any useful logs out. Any idea what I'm doing wrong?

I'm on Amazon EKS with K8S Version v1.16.8-eks-e16311and Metrics Server v0.3.7.

I've verified it isn't permissions. Binding cluster-admin to the scaler pod doesn't seem to help and I get a different error when permissions are missing.

Logs:

I0825 10:58:20.014351      15 metric.go:76] Gathering metrics in per-resource mode
I0825 10:58:20.016279      15 metric.go:94] Attempting to run metric gathering logic
I0825 10:58:20.057419      15 shell.go:80] Shell command failed, stderr: 2020/08/25 10:58:20 invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0825 10:58:20.057450      15 main.go:248] exit status 1

Metrics server is working because kubectl top works:

kubectl top pod | grep content-repo-cache
content-repo-cache-566c695fc8-d6zjr                               4m           912Mi           
content-repo-cache-566c695fc8-dg5jj                               40m          954Mi           
content-repo-cache-scaler                                         2m           7Mi

Here's my YAML:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: content-repo-cache-scaler
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - replicationcontrollers
  - replicationcontrollers/scale
  verbs:
  - '*'
- apiGroups:
  - apps
  resources:
  - deployments
  - deployments/scale
  - replicasets
  - replicasets/scale
  - statefulsets
  - statefulsets/scale
  verbs:
  - '*'
- apiGroups:
  - metrics.k8s.io
  resources:
  - '*'
  verbs:
  - '*'
---
apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
  name: content-repo-cache-scaler
spec:
  template:
    spec:
      containers:
      - name: content-repo-cache-scaler
        image: jthomperoo/predictive-horizontal-pod-autoscaler:v0.5.0
        imagePullPolicy: IfNotPresent
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: content-repo-cache
  provisionRole: true
  config:
    - name: minReplicas
      value: "3"
    - name: maxReplicas
      value: "32"
    - name: logVerbosity
      value: "3"
    - name: predictiveConfig
      value: |
        models:
        - type: HoltWinters
          name: HoltWintersPrediction
          perInterval: 1
          holtWinters:
            alpha: 0.9
            beta: 0.9
            gamma: 0.9
            seasonLength: 4320
            storedSeasons: 4
            method: "additive"
        decisionType: "maximum"
        metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 80
    - name: interval
      value: "20000"
    - name: startTime
      value: "60000"
    - name: downscaleStabilization
      value: "600"

Answer 1 · 2020-08-25T21:14:08.000Z

Hi, thanks very much for pointing this out.

Instead of being provisionRole: true in the CustomPodAutoscaler definition it should be provisionRole: false.

So your YAML should be:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: content-repo-cache-scaler
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - replicationcontrollers
  - replicationcontrollers/scale
  verbs:
  - '*'
- apiGroups:
  - apps
  resources:
  - deployments
  - deployments/scale
  - replicasets
  - replicasets/scale
  - statefulsets
  - statefulsets/scale
  verbs:
  - '*'
- apiGroups:
  - metrics.k8s.io
  resources:
  - '*'
  verbs:
  - '*'
---
apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
  name: content-repo-cache-scaler
spec:
  template:
    spec:
      containers:
      - name: content-repo-cache-scaler
        image: jthomperoo/predictive-horizontal-pod-autoscaler:v0.5.0
        imagePullPolicy: IfNotPresent
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: content-repo-cache 
  provisionRole: false
  config:
    - name: minReplicas
      value: "3"
    - name: maxReplicas
      value: "32"
    - name: logVerbosity
      value: "3"
    - name: predictiveConfig
      value: |
        models:
        - type: HoltWinters
          name: HoltWintersPrediction
          perInterval: 1
          holtWinters:
            alpha: 0.9
            beta: 0.9
            gamma: 0.9
            seasonLength: 4320
            storedSeasons: 4
            method: "additive"
        decisionType: "maximum"
        metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 80
    - name: interval
      value: "20000"
    - name: startTime
      value: "60000"
    - name: downscaleStabilization
      value: "600"

This seems to work on my cluster with that updated, running on EKS v1.16.13-eks-2ba888 with metrics server v0.3.7

https://github.com/jthomperoo/custom-pod-autoscaler-operator/blob/master/USAGE.md#using-custom-resources

This guide outlines how the custom resources feature works in case you're interested.

Could you give that a go and see if that fixes it?

Thanks again, really sorry if this took a lot of your time, I'm sure this was a real headache!

Answer 2 · 2020-08-28T11:16:31.000Z

Not at all. Thanks for being so helpful. Unfortunately the provisionRole true was an error in my original post; it was just something I was trying to see if it made a difference.

Answer 3 · 2020-08-28T11:56:22.000Z

I've just deployed this example (https://github.com/jthomperoo/predictive-horizontal-pod-autoscaler/tree/master/example/simple-linear) in a fresh namespace and have the same problem. I wonder if it is somehow an incompatibility between the latest operator and this older codebase. I'm going to go and try out the Python example from the custom pod autoscaler repo and 🤞

Answer 4 · 2020-08-28T12:50:24.000Z

Ok simple linear example now works in my cluster after scaling metrics-server down from 2 instances from 1. It doesn't seem to like running two replicas. Still got problems with my actual example but one step closer!

Answer 5 · 2020-08-28T13:46:32.000Z

Ok so the example php-apache Deployment in my actual namespace works fine, but I get the invalid metrics error for any of my actual services. I can only assume something in the logic doesn't like something about my Pods, but I can't work out what that would be and I'm not sure how to add logging to debug with the STDIN/STDOUT setup of the application. Any ideas about how to further debug it?

I've attached the Pod YAML for one of my services. Maybe something will jump out at you. 😕

Thanks again for your help with this

pod.yaml.txt

Answer 6 · 2020-08-31T13:29:35.000Z

Hmm OK,

The error:

no metrics returned from resource metrics API

Is from the module:

k8s.io/kubernetes/pkg/controller/podautoscaler/metrics

And the struct MetricsClient which is from the K8s codebase.

This is the same code that the normal Horizontal Pod Autoscaler uses. So I'm not really sure why this would be happening, could you try and deploy a HorizontalPodAutoscaler and check that it is accessing the metrics successfully (you can check the logs by doing ``).

Maybe you could also check the metrics server logs, just to check there isn't any issues with certs or anything (with kubectl logs <metrics server pod> -n <metrics server namespace>).

Here's an example HPA you could deploy:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: content-repo-cache-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: content-repo-cache 
  minReplicas: 3
  maxReplicas: 32
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

You raise some really good points, this project is really hard to debug, I'll add in extra logging and verbosity options for the underlying CPA Horizontal Pod Autoscaler and PHPA, hopefully that will help out a bit.

I'm going to continue playing around with this to see if I can recreate, seems strange that multiple metrics replicas causes it to not work; maybe a sign of an underlying issue.

Answer 7 · 2020-08-31T15:12:17.000Z

Oh! Just looking at your Pod YAML, are you using a ReplicaSet? If so, in your PHPA definition you are targeting:

  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: content-repo-cache

This should be:

  scaleTargetRef:
    apiVersion: apps/v1
    kind: ReplicaSet
    name: content-repo-cache

Just a thought, as from your Pod YAML there is:

ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: content-repo-cache-5cdd7d9cdb
    uid: 8696206b-77ea-4969-8d78-a9d29368afa0

Answer 8 · 2020-10-09T22:32:14.000Z

There hasn't been any activity on this for a while, so I'm gonna close this issue. If it still isn't resolved feel free to reopen it!

Answer 9 · 2021-08-06T15:21:24.000Z

Hi, first let me say that I think this is a great project and I'm interested to see where it goes! Unfortunately, I've been seeing the same problem described here with both your simple-linear-example and simple-holt-winters-example. Metrics server is working fine-- our regular HPAs are able to collect metrics, I can query for metrics with kubectl as well and see cpu/memory metrics.

But I'm also seeing the same error message as @cablespaghetti :

invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API

I tested this both with with your Getting Started Guide, as well as with some of my own deployments and saw the same behavior.

I tracked it down to one of your internal modules: https://github.com/jthomperoo/horizontal-pod-autoscaler/blob/master/metric/gatherer.go#L137, but that's about as far as I got.

EKS K8s version: 1.19
Metrics Server: https://artifacthub.io/packages/helm/bitnami/metrics-server/5.8.9

I wonder if the issue may somehow be related to EKS as looks like both cablespaghetti and myself are using it. Here are the metrics for the php-apache pod from your Getting Started example:

❯ kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/php-apache-d4cf67d68-959mh | jq
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "php-apache-d4cf67d68-959mh",
    "namespace": "default",
    "creationTimestamp": "2021-08-06T15:18:55Z",
    "labels": {
      "pod-template-hash": "d4cf67d68",
      "run": "php-apache"
    }
  },
  "timestamp": "2021-08-06T15:18:01Z",
  "window": "20s",
  "containers": [
    {
      "name": "php-apache",
      "usage": {
        "cpu": "2742606n",
        "memory": "10748Ki"
      }
    }
  ]
}

Anyway, cool project and hope that's helpful. Thanks and good luck!

Answer 10 · 2021-08-13T13:26:39.000Z

Hi first of all let me say this is a terrific project. I am running into the same issues as @garrett-EY and @cablespaghetti where deploying CPA with custom deployment workloads and k8s metric-server returns this error message:

"main.go:277] invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API"

The example deployment file for k8s-metrics-cpu works as expected. I did a little experiment and worked through different deployment files included in the other examples just so to check if everything else are working. I had to add a "resources" spec in the deployment file for http-request since CPA was complaining about "missing cpu requests". I did that and tried launching the app but CPA still complains about it even though the pod has resource specs set.

I compared the k8s-metric and modified http-request deployment file, and noticed that the latter was missing "labels" data under metadata field on line 3, so I added one with an "app" field under it. Lo and behold after doing that, the errors went away and CPA started churning out informational messages about no changes in target replicas.

Could it be that differences in fields in the deployment file are causing the error?

Answer 11 · 2021-08-13T13:30:41.000Z

Here are the two versions of the deployment file I tested. The first one is the modified http-request deployment file, which I added resource:requests field to but still did not work. The second is the one that works with CPA.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-kubernetes
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-kubernetes
  template:
    metadata:
      labels:
        app: hello-kubernetes
    spec:
      containers:
      - name: hello-kubernetes
        image: paulbouwer/hello-kubernetes:1.5
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "200m"

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: hello-kubernetes
  name: hello-kubernetes
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-kubernetes
  template:
    metadata:
      labels:
        app: hello-kubernetes
    spec:
      containers:
      - name: hello-kubernetes
        image: paulbouwer/hello-kubernetes:1.5
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "200m"

Answer 12 · 2021-08-31T21:56:16.000Z

Hey, thanks @jrtmendoza, just tracing through the code and testing out it seems that a label of some kind is required for the K8s metrics server to be able to filter out the app. I should make that point clear in the documentation somewhere - thanks!

As for why this specifically isn't working on EKS, I'm really not sure why this would be happening, I've set up my own EKS cluster to try to recreate this but I've had no success, these are the exact steps I'm following, if there's anything that I'm missing here please let me know (I've probably missed a key detail!):

Create an EKS cluster with version 1.19 called phpa:

eksctl create cluster --name phpa --version 1.19

Install bitnami metrics server 5.8.9 on it:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install metrics-server bitnami/metrics-server --version=5.8.9

After installing it will complain about a metrics service not being enabled in the cluster:

########################################################################################
### ERROR: The metrics.k8s.io/v1beta1 API service is not enabled in the cluster      ###
########################################################################################
You have disabled the API service creation for this release. As the Kubernetes version in the cluster 
does not have metrics.k8s.io/v1beta1, the metrics API will not work with this release unless:

Option A: 

  You complete your metrics-server release by running:

  helm upgrade --namespace default metrics-server bitnami/metrics-server \
    --set apiService.create=true

Option B:
  
   You configure the metrics API service outside of this Helm chart

Run the following to resolve this:

helm upgrade --namespace default metrics-server bitnami/metrics-server --set apiService.create=true

Install the Custom Pod Autoscaler Operator:

VERSION=v1.1.1
HELM_CHART=custom-pod-autoscaler-operator
helm install ${HELM_CHART} https://github.com/jthomperoo/custom-pod-autoscaler-operator/releases/download/${VERSION}/custom-pod-autoscaler-operator-${VERSION}.tgz

Create a deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: php-apache
  name: php-apache
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: k8s.gcr.io/hpa-example
        imagePullPolicy: Always
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
      restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: php-apache
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: php-apache
  sessionAffinity: None
  type: ClusterIP

Deploy the deployment:

kubectl apply -f deployment.yaml

Create a PHPA YAML file:

apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
  name: simple-linear-example
spec:
  template:
    spec:
      containers:
      - name: simple-linear-example
        image: jthomperoo/predictive-horizontal-pod-autoscaler:latest
        imagePullPolicy: Always
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  roleRequiresMetricsServer: true
  config:
    - name: minReplicas
      value: "1"
    - name: maxReplicas
      value: "10"
    - name: predictiveConfig
      value: |
        models:
        - type: Linear
          name: LinearPrediction
          perInterval: 1
          linear:
            lookAhead: 10000
            storedValues: 6
        decisionType: "maximum"
        metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              averageUtilization: 50
              type: Utilization
    - name: interval
      value: "10000"
    - name: downscaleStabilization
      value: "0"

Deploy the PHPA:

kubectl apply -f phpa.yaml

Watch the scaler pod when it has loaded:

kubectl logs simple-linear-example --follow

It may take a minute or two for the metrics to be available, depending on how quickly the php-apache pod was provisioned, it may give the error:

invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API

Eventually this will stop and it will begin scaling:

I0831 21:55:44.799032       1 scaling.go:118] Picked max evaluation over stabilization window of 0 seconds; replicas 1
I0831 21:55:44.799317       1 scaling.go:170] No change in target replicas, maintaining 1 replicas