Unable to get metrics for resource CPU metrics on EKS
Closed this issue · 12 comments
Hi,
Unfortunately I'm really struggling to work out why this won't pick up metrics for my service. Even with logVerbosity: 3 I can't get any useful logs out. Any idea what I'm doing wrong?
I'm on Amazon EKS with K8S Version v1.16.8-eks-e16311
and Metrics Server v0.3.7.
I've verified it isn't permissions. Binding cluster-admin to the scaler pod doesn't seem to help and I get a different error when permissions are missing.
Logs:
I0825 10:58:20.014351 15 metric.go:76] Gathering metrics in per-resource mode
I0825 10:58:20.016279 15 metric.go:94] Attempting to run metric gathering logic
I0825 10:58:20.057419 15 shell.go:80] Shell command failed, stderr: 2020/08/25 10:58:20 invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API
E0825 10:58:20.057450 15 main.go:248] exit status 1
Metrics server is working because kubectl top works:
kubectl top pod | grep content-repo-cache
content-repo-cache-566c695fc8-d6zjr 4m 912Mi
content-repo-cache-566c695fc8-dg5jj 40m 954Mi
content-repo-cache-scaler 2m 7Mi
Here's my YAML:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: content-repo-cache-scaler
rules:
- apiGroups:
- ""
resources:
- pods
- replicationcontrollers
- replicationcontrollers/scale
verbs:
- '*'
- apiGroups:
- apps
resources:
- deployments
- deployments/scale
- replicasets
- replicasets/scale
- statefulsets
- statefulsets/scale
verbs:
- '*'
- apiGroups:
- metrics.k8s.io
resources:
- '*'
verbs:
- '*'
---
apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
name: content-repo-cache-scaler
spec:
template:
spec:
containers:
- name: content-repo-cache-scaler
image: jthomperoo/predictive-horizontal-pod-autoscaler:v0.5.0
imagePullPolicy: IfNotPresent
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: content-repo-cache
provisionRole: true
config:
- name: minReplicas
value: "3"
- name: maxReplicas
value: "32"
- name: logVerbosity
value: "3"
- name: predictiveConfig
value: |
models:
- type: HoltWinters
name: HoltWintersPrediction
perInterval: 1
holtWinters:
alpha: 0.9
beta: 0.9
gamma: 0.9
seasonLength: 4320
storedSeasons: 4
method: "additive"
decisionType: "maximum"
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- name: interval
value: "20000"
- name: startTime
value: "60000"
- name: downscaleStabilization
value: "600"
Hi, thanks very much for pointing this out.
Instead of being provisionRole: true
in the CustomPodAutoscaler
definition it should be provisionRole: false
.
So your YAML should be:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: content-repo-cache-scaler
rules:
- apiGroups:
- ""
resources:
- pods
- replicationcontrollers
- replicationcontrollers/scale
verbs:
- '*'
- apiGroups:
- apps
resources:
- deployments
- deployments/scale
- replicasets
- replicasets/scale
- statefulsets
- statefulsets/scale
verbs:
- '*'
- apiGroups:
- metrics.k8s.io
resources:
- '*'
verbs:
- '*'
---
apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
name: content-repo-cache-scaler
spec:
template:
spec:
containers:
- name: content-repo-cache-scaler
image: jthomperoo/predictive-horizontal-pod-autoscaler:v0.5.0
imagePullPolicy: IfNotPresent
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: content-repo-cache
provisionRole: false
config:
- name: minReplicas
value: "3"
- name: maxReplicas
value: "32"
- name: logVerbosity
value: "3"
- name: predictiveConfig
value: |
models:
- type: HoltWinters
name: HoltWintersPrediction
perInterval: 1
holtWinters:
alpha: 0.9
beta: 0.9
gamma: 0.9
seasonLength: 4320
storedSeasons: 4
method: "additive"
decisionType: "maximum"
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- name: interval
value: "20000"
- name: startTime
value: "60000"
- name: downscaleStabilization
value: "600"
This seems to work on my cluster with that updated, running on EKS v1.16.13-eks-2ba888
with metrics server v0.3.7
This guide outlines how the custom resources feature works in case you're interested.
Could you give that a go and see if that fixes it?
Thanks again, really sorry if this took a lot of your time, I'm sure this was a real headache!
Not at all. Thanks for being so helpful. Unfortunately the provisionRole true was an error in my original post; it was just something I was trying to see if it made a difference.
I've just deployed this example (https://github.com/jthomperoo/predictive-horizontal-pod-autoscaler/tree/master/example/simple-linear) in a fresh namespace and have the same problem. I wonder if it is somehow an incompatibility between the latest operator and this older codebase. I'm going to go and try out the Python example from the custom pod autoscaler repo and 🤞
Ok simple linear example now works in my cluster after scaling metrics-server down from 2 instances from 1. It doesn't seem to like running two replicas. Still got problems with my actual example but one step closer!
Ok so the example php-apache Deployment in my actual namespace works fine, but I get the invalid metrics error for any of my actual services. I can only assume something in the logic doesn't like something about my Pods, but I can't work out what that would be and I'm not sure how to add logging to debug with the STDIN/STDOUT setup of the application. Any ideas about how to further debug it?
I've attached the Pod YAML for one of my services. Maybe something will jump out at you. 😕
Thanks again for your help with this
Hmm OK,
The error:
no metrics returned from resource metrics API
Is from the module:
k8s.io/kubernetes/pkg/controller/podautoscaler/metrics
And the struct MetricsClient
which is from the K8s codebase.
This is the same code that the normal Horizontal Pod Autoscaler uses. So I'm not really sure why this would be happening, could you try and deploy a HorizontalPodAutoscaler
and check that it is accessing the metrics successfully (you can check the logs by doing ``).
Maybe you could also check the metrics server logs, just to check there isn't any issues with certs or anything (with kubectl logs <metrics server pod> -n <metrics server namespace>
).
Here's an example HPA you could deploy:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: content-repo-cache-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: content-repo-cache
minReplicas: 3
maxReplicas: 32
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
You raise some really good points, this project is really hard to debug, I'll add in extra logging and verbosity options for the underlying CPA Horizontal Pod Autoscaler and PHPA, hopefully that will help out a bit.
I'm going to continue playing around with this to see if I can recreate, seems strange that multiple metrics replicas causes it to not work; maybe a sign of an underlying issue.
Oh! Just looking at your Pod YAML, are you using a ReplicaSet
? If so, in your PHPA definition you are targeting:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: content-repo-cache
This should be:
scaleTargetRef:
apiVersion: apps/v1
kind: ReplicaSet
name: content-repo-cache
Just a thought, as from your Pod YAML there is:
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: content-repo-cache-5cdd7d9cdb
uid: 8696206b-77ea-4969-8d78-a9d29368afa0
There hasn't been any activity on this for a while, so I'm gonna close this issue. If it still isn't resolved feel free to reopen it!
Hi, first let me say that I think this is a great project and I'm interested to see where it goes! Unfortunately, I've been seeing the same problem described here with both your simple-linear-example
and simple-holt-winters-example
. Metrics server is working fine-- our regular HPAs are able to collect metrics, I can query for metrics with kubectl as well and see cpu/memory metrics.
But I'm also seeing the same error message as @cablespaghetti :
invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API
I tested this both with with your Getting Started Guide, as well as with some of my own deployments and saw the same behavior.
I tracked it down to one of your internal modules: https://github.com/jthomperoo/horizontal-pod-autoscaler/blob/master/metric/gatherer.go#L137, but that's about as far as I got.
EKS K8s version: 1.19
Metrics Server: https://artifacthub.io/packages/helm/bitnami/metrics-server/5.8.9
I wonder if the issue may somehow be related to EKS as looks like both cablespaghetti and myself are using it. Here are the metrics for the php-apache
pod from your Getting Started example:
❯ kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/php-apache-d4cf67d68-959mh | jq
{
"kind": "PodMetrics",
"apiVersion": "metrics.k8s.io/v1beta1",
"metadata": {
"name": "php-apache-d4cf67d68-959mh",
"namespace": "default",
"creationTimestamp": "2021-08-06T15:18:55Z",
"labels": {
"pod-template-hash": "d4cf67d68",
"run": "php-apache"
}
},
"timestamp": "2021-08-06T15:18:01Z",
"window": "20s",
"containers": [
{
"name": "php-apache",
"usage": {
"cpu": "2742606n",
"memory": "10748Ki"
}
}
]
}
Anyway, cool project and hope that's helpful. Thanks and good luck!
Hi first of all let me say this is a terrific project. I am running into the same issues as @garrett-EY and @cablespaghetti where deploying CPA with custom deployment workloads and k8s metric-server returns this error message:
"main.go:277] invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API"
The example deployment file for k8s-metrics-cpu works as expected. I did a little experiment and worked through different deployment files included in the other examples just so to check if everything else are working. I had to add a "resources" spec in the deployment file for http-request since CPA was complaining about "missing cpu requests". I did that and tried launching the app but CPA still complains about it even though the pod has resource specs set.
I compared the k8s-metric and modified http-request deployment file, and noticed that the latter was missing "labels" data under metadata field on line 3, so I added one with an "app" field under it. Lo and behold after doing that, the errors went away and CPA started churning out informational messages about no changes in target replicas.
Could it be that differences in fields in the deployment file are causing the error?
Here are the two versions of the deployment file I tested. The first one is the modified http-request deployment file, which I added resource:requests field to but still did not work. The second is the one that works with CPA.
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-kubernetes
spec:
replicas: 1
selector:
matchLabels:
app: hello-kubernetes
template:
metadata:
labels:
app: hello-kubernetes
spec:
containers:
- name: hello-kubernetes
image: paulbouwer/hello-kubernetes:1.5
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: hello-kubernetes
name: hello-kubernetes
spec:
replicas: 1
selector:
matchLabels:
app: hello-kubernetes
template:
metadata:
labels:
app: hello-kubernetes
spec:
containers:
- name: hello-kubernetes
image: paulbouwer/hello-kubernetes:1.5
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
Hey, thanks @jrtmendoza, just tracing through the code and testing out it seems that a label of some kind is required for the K8s metrics server to be able to filter out the app. I should make that point clear in the documentation somewhere - thanks!
As for why this specifically isn't working on EKS, I'm really not sure why this would be happening, I've set up my own EKS cluster to try to recreate this but I've had no success, these are the exact steps I'm following, if there's anything that I'm missing here please let me know (I've probably missed a key detail!):
- Create an EKS cluster with version
1.19
calledphpa
:
eksctl create cluster --name phpa --version 1.19
- Install bitnami metrics server
5.8.9
on it:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install metrics-server bitnami/metrics-server --version=5.8.9
- After installing it will complain about a metrics service not being enabled in the cluster:
########################################################################################
### ERROR: The metrics.k8s.io/v1beta1 API service is not enabled in the cluster ###
########################################################################################
You have disabled the API service creation for this release. As the Kubernetes version in the cluster
does not have metrics.k8s.io/v1beta1, the metrics API will not work with this release unless:
Option A:
You complete your metrics-server release by running:
helm upgrade --namespace default metrics-server bitnami/metrics-server \
--set apiService.create=true
Option B:
You configure the metrics API service outside of this Helm chart
Run the following to resolve this:
helm upgrade --namespace default metrics-server bitnami/metrics-server --set apiService.create=true
- Install the Custom Pod Autoscaler Operator:
VERSION=v1.1.1
HELM_CHART=custom-pod-autoscaler-operator
helm install ${HELM_CHART} https://github.com/jthomperoo/custom-pod-autoscaler-operator/releases/download/${VERSION}/custom-pod-autoscaler-operator-${VERSION}.tgz
- Create a
deployment.yaml
file:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: php-apache
name: php-apache
spec:
replicas: 1
selector:
matchLabels:
run: php-apache
template:
metadata:
labels:
run: php-apache
spec:
containers:
- image: k8s.gcr.io/hpa-example
imagePullPolicy: Always
name: php-apache
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 500m
requests:
cpu: 200m
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
run: php-apache
sessionAffinity: None
type: ClusterIP
- Deploy the deployment:
kubectl apply -f deployment.yaml
- Create a PHPA YAML file:
apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
name: simple-linear-example
spec:
template:
spec:
containers:
- name: simple-linear-example
image: jthomperoo/predictive-horizontal-pod-autoscaler:latest
imagePullPolicy: Always
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
roleRequiresMetricsServer: true
config:
- name: minReplicas
value: "1"
- name: maxReplicas
value: "10"
- name: predictiveConfig
value: |
models:
- type: Linear
name: LinearPrediction
perInterval: 1
linear:
lookAhead: 10000
storedValues: 6
decisionType: "maximum"
metrics:
- type: Resource
resource:
name: cpu
target:
averageUtilization: 50
type: Utilization
- name: interval
value: "10000"
- name: downscaleStabilization
value: "0"
- Deploy the PHPA:
kubectl apply -f phpa.yaml
- Watch the scaler pod when it has loaded:
kubectl logs simple-linear-example --follow
- It may take a minute or two for the metrics to be available, depending on how quickly the
php-apache
pod was provisioned, it may give the error:
invalid metrics (1 invalid out of 1), first error is: failed to get resource metric: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Eventually this will stop and it will begin scaling:
I0831 21:55:44.799032 1 scaling.go:118] Picked max evaluation over stabilization window of 0 seconds; replicas 1
I0831 21:55:44.799317 1 scaling.go:170] No change in target replicas, maintaining 1 replicas