Optimization is not working - Azure AKS - v1.25.6

Question

Optimization is not working - Azure AKS - v1.25.6

Opened this issue 2 years ago · 19 comments

zohebk8s commented 2 years ago

Hi Team,

First of all, it looks like a new tool and it can play an important role as well.

I just quickly tested it in Azure AKS v1.25.6. Below are my findings/comments:

First, a small correction in the helm install command - We should use the name as well while installing.

helm install kube-reqsizer/kube-reqsizer --> helm install kube-reqsizer kube-reqsizer/kube-reqsizer

I've deployed a basic application in the default namespace with high CPU/memory requests to test, whether kube-reqsizer will optimize or not. Waited for 22 mins, but still, it was the same.
Logs FYR

I0530 15:58:39.252063 1 request.go:601] Waited for 1.996392782s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:58:49.252749 1 request.go:601] Waited for 1.995931495s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:58:59.450551 1 request.go:601] Waited for 1.994652278s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:59:09.450621 1 request.go:601] Waited for 1.994074539s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system
I0530 15:59:19.450824 1 request.go:601] Waited for 1.99598317s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape
I0530 15:59:29.650328 1 request.go:601] Waited for 1.993913908s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/tigera-operator
I0530 15:59:39.650831 1 request.go:601] Waited for 1.996110718s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape
I0530 15:59:49.850897 1 request.go:601] Waited for 1.995571438s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system
I0530 16:00:00.049996 1 request.go:601] Waited for 1.994819712s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/calico-system
I0530 16:00:10.050864 1 request.go:601] Waited for 1.991681441s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/default

How much time it will take to optimize? Will it restart the pod automatically?
I haven't customized any values, just used the below commands to install.

helm repo add kube-reqsizer https://jatalocks.github.io/kube-reqsizer/
helm repo update
helm install kube-reqsizer kube-reqsizer/kube-reqsizer

Answer 1 · 2023-05-30T16:12:36.000Z

Hey @zohebk8s , thanks for trying out the tool.

I've seen this occur to different people, and it seems like the kubeapi is too slow for the default configuration of the chart. For this, you need to set concurrentWorkers to 1.

This issue had the same problem as yours. Please see the correspondence here:

#30

Thanks! And tell me how it went

Answer 2 · 2023-05-30T16:13:21.000Z

#30 (comment)

Answer 3 · 2023-05-30T16:57:54.000Z

@jatalocks Thanks for your response.

I've updated the concurrentWorkers to "1" and the value of min-seconds in kube-reqsizer is also "1" as shown below: But still it's not updating the values. I am I missing something here?

I've added below annotations for that deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
annotations:
reqsizer.jatalocks.github.io/optimize: "true" # Ignore Pod/Namespace when optimizing entire cluster
reqsizer.jatalocks.github.io/mode: "average" # Default Mode. Optimizes based on average. If ommited, mode is average
reqsizer.jatalocks.github.io/mode: "max" # Sets the request to the MAXIMUM of all sample points
reqsizer.jatalocks.github.io/mode: "min" # Sets the request to the MINIMUM of all sample points

Answer 4 · 2023-05-30T17:02:15.000Z

Hey @zohebk8s , can you send a screenshot of the logs now? (A few minutes after the controller has started working). It might take for it some minutes to resize

Answer 5 · 2023-05-30T17:03:28.000Z

Also, try adding the "optimize" annotation to the namespace this deployment is in

Answer 6 · 2023-05-30T18:33:50.000Z

I've added annotation to the default namespace, where this deployment is running. But still, the values are the same
kube-reqsizer-controller-manager-795bbd7677-dl4xx-logs.txt
and they didn't change.

The utilization of the pods is very normal and I was expecting a change/optimization from kube-reqsizer. In the request, I've specified below values:
cpu: "100m"
memory: 400Mi

I've attached the full log file FYR. Please refer attached txt file

Answer 7 · 2023-05-31T08:24:45.000Z

@zohebk8s it appears it's working. If you gave it time through the night, did it eventually work? It might take some time on concurrentWorkers=1 but eventually it has enough data in cache to make the decision.

Answer 8 · 2023-05-31T09:27:12.000Z

From logs, it looks like work. But it's not resizing/optimizing the workload. Still, I see no changes in CPU/memory requests for that deployment. Usually, it should not take this much time to take action.

Answer 9 · 2023-05-31T09:28:18.000Z

@jatalocks Even if you see the cache sample is 278. Do you think this data is not enough for decision-making? Is there any specific amount of samples it will collect and take decision?

Answer 10 · 2023-05-31T09:32:01.000Z

That's odd, it should have worked immediately. I think something prevents it from allowing it to resize. What's your values/configuration? You should make sure minSeconds=1 and sampleSize=1 as well.

Answer 11 · 2023-05-31T09:33:56.000Z

The configuration should match what's on the top of the Readme (except concurrentWorkers=1)

Answer 12 · 2023-05-31T09:58:06.000Z

Already it's "1" for concurrent-workers, minSeconds & sampleSize.

It's Azure AKS - v1.25.6 and the default namespace is istio injected. I hope it's not something specific to Istio.

configuration:

spec:
containers:
- args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8080
- --leader-elect
- --annotation-filter=true
- --sample-size=1
- --min-seconds=1
- --zap-log-level=info
- --enable-increase=true
- --enable-reduce=true
- --max-cpu=0
- --max-memory=0
- --min-cpu=0
- --min-memory=0
- --min-cpu-increase-percentage=0
- --min-memory-increase-percentage=0
- --min-cpu-decrease-percentage=0
- --min-memory-decrease-percentage=0
- --cpu-factor=1
- --memory-factor=1
- --concurrent-workers=1
- --enable-persistence=true
- --redis-host=kube-reqsizer-redis-master

Answer 13 · 2023-05-31T10:16:21.000Z

What are resource requirements for the deployments in default namespace? The only thing I could think of is that it doesn't have anything to resize so it just continues sampling the pods. Also if there are no requests/limits to begin with there's nothing to resize from. I'd check that the pods are configured with resources

Answer 14 · 2023-05-31T10:36:24.000Z

I've defined requests/limits for this deployment and the utilization is very less, that's the reason I thought of raising this question/issue.

If it doesn't have requests/limits, then as you said it won't work. But in case, I've defined requests/limits and the CPU/memory utilization is very less as well.

Answer 15 · 2023-05-31T10:41:11.000Z

I see that reqsizer is alive for 11 minutes. I'd give it some more time for now and I'll check if there's a specific problem with AKS

Answer 16 · 2023-05-31T10:46:53.000Z

@jatalocks Thank you for your patience and response, as I feel like this product can make a difference if it works properly. As it's more related to resource optimation which is directly proportional to cost optimization.

Answer 17 · 2023-06-02T18:48:32.000Z

@jatalocks Is it a bug? Or kind of enhancement required at product level?

I hope the information which I’ve shared is of help.

Answer 18 · 2023-06-02T19:02:37.000Z

@zohebk8s I think that by now if the controller has been continuously running the app should have already been resized

Answer 19 · 2024-02-09T18:54:10.000Z

@ElementTech i see that @zohebk8s seem to be using argo-cd in this cluster, can be that argo-cd is directly undoing all the changes done on the resources of the Deployment?