KEDA Scaled Object Primary creation failed due to workload already managed by the hpa

Question

KEDA Scaled Object Primary creation failed due to workload already managed by the hpa

jdgeisler opened this issue 8 months ago · 2 comments

Describe the bug

Following the flagger docs for keda integration, I updated the canary to point to the KEDA scaled object instead of the HPA. Additionally, I added the annotation in the keda scaled object so it takes over ownership of the old HPA. Flagger then goes to create the primary scaledObject from the scaledObject, however since the old primary HPA is not cleaned up by Flagger when the original HPA was removed, it still exists and causes the below error due to the collision.

{"level":"info","ts":"2024-05-16T14:18:12.738Z","caller":"controller/events.go:45","msg":"creating Keda ScaledObject fortio-server-deployment-2-primary.fortio failed: admission webhook \"vscaledobject.kb.io\" denied the request: the workload 'fortio-server-deployment-2-primary' of type 'apps/v1.Deployment' is already managed by the hpa 'fortio-server-deployment-2-primary'","canary":"fortio-server-deployment-2.fortio"}

The behavior I would expect in Flagger when migrating from an HPA to a Scaled Object in the Canary is to check if the primary from the hpa still exists. If it does exist, it should add the same annotation in the keda scaled object to the new primary scaled object, so instead of colliding with the already existing hpa, it takes over ownership of it.

I tried various workarounds as well, such as manually creating the primary scaledObject and taking ownership of the already existing primary hpa, but then on a canary analysis run Flagger reconciles and updates the primary scaledObject to remove the ownership annotation, so it then creates the duplicate scaledObject anyway.

I dug into the code and found where the scaledObject primary reconciliation happens, and I think it would be relatively straight forward to implement the above logic to add the keda transfer ownership annotation to the primary scaled object.

https://github.com/fluxcd/flagger/blob/main/pkg/canary/scaled_object_reconciler.go#L40-L65

Without this implemented, it seems I am unable to migrate from an HPA to a Scaled Object in a canary unless we do a multiple step process to manually delete the old primary HPA so the new primary scaled object can be created.

To Reproduce

Have an already existing Canary referencing an HPA
Replace the HPA with the scaled object and then update the autoscalerRef in the canary to point to the scaled object

fortio, fortio-server-deployment-2, Canary (flagger.app) has changed:
  # Source: fortio/templates/fortio.yaml
  apiVersion: flagger.app/v1beta1
  kind: Canary
  metadata:
    name: fortio-server-deployment-2
    namespace: fortio
  spec:
    analysis:
      interval: 30s
      maxWeight: 50
      metrics:
      - interval: 30s
        name: request-success-rate
        thresholdRange:
          min: 99
      stepWeight: 10
      threshold: 10
      webhooks:
      - metadata:
          async: "on"
          c: "6"
          labels: fortio-server-deployment-2
          load: Start
          nocatchup: "on"
          qps: "12"
          save: "on"
          stdclient: "on"
          t: 600s
          uniform: "on"
          url: http://fortio-server-deployment-2-canary.fortio.svc.cluster.local:8080/
        name: generateLoad
        timeout: 60s
        type: pre-rollout
        url: http://fortio-client.fortio.svc.cluster.local:8080/fortio/rest/run?jsonPath=.metadata
    progressDeadlineSeconds: 600
    autoscalerRef:
-     apiVersion: autoscaling/v2
-     kind: HorizontalPodAutoscaler
+     apiVersion: keda.sh/v1alpha1
+     kind: ScaledObject
      name: fortio-server-deployment-2
    targetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: fortio-server-deployment-2
    service:
      port: 8080
      targetPort: 8080
fortio, fortio-server-deployment-2, HorizontalPodAutoscaler (autoscaling) has been removed:
- # Source: fortio/templates/fortio.yaml
- apiVersion: autoscaling/v2
- kind: HorizontalPodAutoscaler
- metadata:
-   name: fortio-server-deployment-2
-   namespace: fortio
- spec:
-   maxReplicas: 4
-   metrics:
-   - resource:
-       name: cpu
-       target:
-         averageUtilization: 99
-         type: Utilization
-     type: Resource
-   minReplicas: 2
-   scaleTargetRef:
-     apiVersion: apps/v1
-     kind: Deployment
-     name: fortio-server-deployment-2
+
fortio, fortio-server-deployment-2, ScaledObject (keda.sh) has been added:
-
+ # Source: fortio/templates/fortio.yaml
+ apiVersion: keda.sh/v1alpha1
+ kind: ScaledObject
+ metadata:
+   name: fortio-server-deployment-2
+   namespace: fortio
+   annotations:
+     scaledobject.keda.sh/transfer-hpa-ownership: "true"
+ spec:
+   advanced:
+     horizontalPodAutoscalerConfig:
+       name: fortio-server-deployment-2
+   scaleTargetRef:
+     name: fortio-server-deployment-2
+   minReplicaCount: 2
+   maxReplicaCount: 4
+   triggers:
+   - type: prometheus
+     metadata:
+       serverAddress: http://kube-prometheus-stack-prometheus:9090/prometheus
+       threshold: '100'
+       query: sum(rate(istio_requests_total{destination_service_name="fortio-server-2"}[1m]))
+   - type: cpu
+     metricType: Utilization
+     metadata:
+       value: "99"

See that the scaled object is created and takes ownership of the original HPA. However, Flagger then fails to create the new primary scaled object due to the collision in ownership of the primary HPA already existing. In the below screenshot you can see the primary is not updated to use the custom metric in the keda scaled object

 k get pods
NAME                                                  READY   STATUS    RESTARTS   AGE
fortio-client-deployment-75c6f57dc4-jws2f             2/2     Running   0          14d
fortio-http-proxy-parallel-b556bc8dc-6ffc4            2/2     Running   0          14d
fortio-http-proxy-serial-7d54446cd-44wml              2/2     Running   0          14d
fortio-server-deployment-1-7d6fd57d76-26bzd           2/2     Running   0          14d
fortio-server-deployment-1-7d6fd57d76-rrscw           2/2     Running   0          14d
fortio-server-deployment-2-88c7fb48b-dg2l4            2/2     Running   0          8m50s
fortio-server-deployment-2-88c7fb48b-zm5tn            2/2     Running   0          8m50s
fortio-server-deployment-2-primary-6d9fb88898-8rrrn   2/2     Running   0          24m
fortio-server-deployment-2-primary-6d9fb88898-z4bdb   2/2     Running   0          24m
fortio-tcp-proxy-6f566d4cfc-qzsgq                     2/2     Running   0          14d
 k get hpa
NAME                                 REFERENCE                                       TARGETS               MINPODS   MAXPODS   REPLICAS   AGE
fortio-server-deployment-2           Deployment/fortio-server-deployment-2           0/100 (avg), 4%/99%   2         4         2          8m55s
fortio-server-deployment-2-primary   Deployment/fortio-server-deployment-2-primary   3%/99%                2         4         2          24m
 k get scaledobject
NAME                         SCALETARGETKIND      SCALETARGETNAME              MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
fortio-server-deployment-2   apps/v1.Deployment   fortio-server-deployment-2   2     4     prometheus                    True    False    False      Unknown   9m1s

See the below error of the primary scaled object failing to create

{"level":"info","ts":"2024-05-16T14:18:12.738Z","caller":"controller/events.go:45","msg":"creating Keda ScaledObject fortio-server-deployment-2-primary.fortio failed: admission webhook \"vscaledobject.kb.io\" denied the request: the workload 'fortio-server-deployment-2-primary' of type 'apps/v1.Deployment' is already managed by the hpa 'fortio-server-deployment-2-primary'","canary":"fortio-server-deployment-2.fortio"}

Ideally, when the primary scaled object is created by flagger, it should have the below annotation to take ownership of the primary HPA if it already exists. Otherwise it would be impossible to migrate from an hpa to a scaled object without manually deleting the old primary hpa first.

  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      name: fortio-server-deployment-2-primary

Expected behavior

This is described above in the description and steps to reproduce.

The behavior I would expect in Flagger when migrating from an HPA to a Scaled Object in the Canary is to check if the primary from the hpa still exists. If it does exist, it should add the annotation in the keda scaled object to the new primary scaled object, so instead of colliding with the already existing hpa, it takes over ownership of it.

Additional context

Flagger version: 1.34.0
Kubernetes version: 1.27
Service Mesh provider: Istio 1.19.7

Answer 1 · 2024-06-04T18:16:43.000Z

I think a possible solution could roughly look like the following:

When the primary scaled object is created by Flagger, I think the annotations from the target scaled object that is cloned should also be included. This way, if the scaledobject.keda.sh/transfer-hpa-ownership annotation is added to the target scaled object, then it will also be added to the primary scaled object. This would prevent the collision between the old hpa and the new hpa being created.

primarySo = &keda.ScaledObject
        // Passing in the annotations from the targetSo so that they are carried over to the primarySo. This is required so that the transfer ownership annotation can be added.
	ObjectMeta: makeObjectMetaSo(primarySoName, targetSoClone.Labels, targetSoClone.Annotations, cd),
	Spec:       soSpec,
}

Additionally, the Advanced.HorizontalPodAutoscalerConfig.Name in the keda.ScaledObjectSpec should also be configurable and set to the primary name in this case. Otherwise, it will have the default keda-hpa prefix. This is detailed in the KEDA documentation here.

// Set the Name of the horizontal pod autoscaler to whatever the primary name should be, otherwise the hpa name won't match the scaled object name
Advanced: &keda.AdvancedConfig{
	RestoreToOriginalReplicaCount: targetSoClone.Spec.Advanced.RestoreToOriginalReplicaCount,
	HorizontalPodAutoscalerConfig: &keda.HorizontalPodAutoscalerConfig{
		Name:     primaryName,
		Behavior: targetSoClone.Spec.Advanced.HorizontalPodAutoscalerConfig.Behavior,
	},
},

With these changes, when the following ScaledObject is created and the Canary is being updated to reference a ScaledObject instead of an HPA, instead of failing to create the new primary hpa since it already exists, it would instead take ownership of it.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fortio-server-deployment-2
  namespace: fortio
  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      name: fortio-server-deployment-2
  scaleTargetRef:
    name: fortio-server-deployment-2
  minReplicaCount: 2
  maxReplicaCount: 4
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://kube-prometheus-stack-prometheus:9090/prometheus
      threshold: '100'
      query: sum(rate(istio_requests_total{destination_service_name="fortio-server-2"}[1m]))
  - type: cpu
    metricType: Utilization
    metadata:
      value: "99"

The above annotation from the target scaled object would be added to the primary scaled object and it would reference the correct fortio-server-deployment-2-primary hpa.

Answer 2 · 2024-06-10T19:47:56.000Z

Also seeing this same behavior when switching the autoscalerRef from HPA to Scaled Objects