practo/k8s-worker-pod-autoscaler

V1.0.0 scale down is never happening because msgsRecieved is not updated

AmeerAssi opened this issue · 4 comments

I am testing auto scaler version 1.0.0 as I see it was lately released. experiencing this behavior:
after having scale up, and the work finished, scale down does not happen.
looking on my queue in AWS console I see its empty without messages in flights for more than 20 mins:
image
here is monitor status for the queue, where you can see that all the message handling had finished before 22:00:
image

when looking on the auto-scaler logs I see that the replicas are not scaled down because there are received messages:
two snapshots of times in logs, where time differences more than 20 mins (the cache according to the documentation in the code should be 1 min)
image

image

it looks like the msgsReceived cache is never refreshed.

Pod describe info:
Name: workerpodautoscaler-57fc6bf9d9-225db
Namespace: kube-system
Priority: 1000
Priority Class Name: infra-normal-priority
Node: ip-192-168-127-142.us-east-2.compute.internal/192.168.127.142
Start Time: Sun, 12 Jul 2020 00:41:20 +0300
Labels: app=workerpodautoscaler
pod-template-hash=57fc6bf9d9
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.126.38
Controlled By: ReplicaSet/workerpodautoscaler-57fc6bf9d9
Containers:
wpa:
Container ID: docker://4898ad92c38baed27d84a0f206ee60b85f0b149526142a2abfd956dccc676069
Image: practodev/workerpodautoscaler:v1.0.0
Image ID: docker-pullable://practodev/workerpodautoscaler@sha256:2bdcaa251e2a2654e73121721589ac5bb8536fbeebc2b7a356d24199ced84e73
Port:
Host Port:
Command:
/workerpodautoscaler
run
--resync-period=60
--wpa-threads=10
--aws-regions=us-east-2
--sqs-short-poll-interval=20
--sqs-long-poll-interval=20
--wpa-default-max-disruption=0
State: Running
Started: Sun, 12 Jul 2020 00:41:22 +0300
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 100Mi
Requests:
cpu: 10m
memory: 20Mi
Environment Variables from:
workerpodautoscaler-secret-env Secret Optional: false
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from workerpodautoscaler-token-j8lvc (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
workerpodautoscaler-token-j8lvc:
Type: Secret (a volume populated by a Secret)
SecretName: workerpodautoscaler-token-j8lvc
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: :NoExecute
:NoSchedule
Events:
Type Reason Age From Message


Normal Scheduled 45m default-scheduler Successfully assigned kube-system/workerpodautoscaler-57fc6bf9d9-225db to ip-192-168-127-142.us-east-2.compute.internal
Normal Pulling 45m kubelet, ip-192-168-127-142.us-east-2.compute.internal Pulling image "practodev/workerpodautoscaler:v1.0.0"
Normal Pulled 45m kubelet, ip-192-168-127-142.us-east-2.compute.internal Successfully pulled image "practodev/workerpodautoscaler:v1.0.0"
Normal Created 45m kubelet, ip-192-168-127-142.us-east-2.compute.internal Created container wpa
Normal Started 45m kubelet, ip-192-168-127-142.us-east-2.compute.internal Started container wpa

WPA deployment:
apiVersion: k8s.practo.dev/v1alpha1
kind: WorkerPodAutoScaler
metadata:
creationTimestamp: "2020-01-28T14:59:16Z"
generation: 5316
name: processor-ip4m
namespace: default
resourceVersion: "52253623"
selfLink: /apis/k8s.practo.dev/v1alpha1/namespaces/default/workerpodautoscalers/processor-ip4m
uid: c111ba43-41de-11ea-b4d5-066ce59a32e8
spec:
deploymentName: processor-ip4m
maxDisruption: null
maxReplicas: 80
minReplicas: 1
queueURI: **************
secondsToProcessOneJob: 10
targetMessagesPerWorker: 720
status:
CurrentMessages: 0
CurrentReplicas: 31
DesiredReplicas: 31

Thanks for reporting this. Working on the fix #99

@AmeerAssi

We have not released it in Github release yet. But have pushed the following docker images for use:

pushed: practodev/workerpodautoscaler:v1.0.0-21-gfdb7dcd
pushed: practodev/workerpodautoscaler:v1.0
pushed: practodev/workerpodautoscaler:v1

Please try it out and let us know if the issue gets fixed for you!
Thanks again for reporting this. 👍

I ran into this same issue. Updating to v1.0 from v.1.0.0 worked for me.

Yes this was a major issue, planning to release this soon in Github latest release as v1.1.0