l7mp/stunner

Stunner gateway operator `ERROR updater cannot update service` on AWS + EKS + ALB due to `"cannot upsert service \"stunner/udp-gateway\": Service \"udp-gateway\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set"`

calexandre opened this issue · 7 comments

Hey all,
I'm getting this error when provisioning Stunner on AWS + EKS + ALB.

The error seems pretty straightforward. Here's my stunner logs:

2023-10-08T19:36:50.449897717Z  ERROR   updater cannot update service   {"operation": "unchanged", "service": "{\"metadata\":{\"name\":\"udp-gateway\",\"namespace\":\"stunner\",\"creationTimestamp\":null,\"labels\":{\"stunner.l7mp.io/owned-by\":\"stunner\",\"stunner.l7mp.io/related-gateway-name\":\"udp-gateway\",\"stunner.l7mp.io/related-gateway-namespace\":\"stunner\"},\"annotations\":{\"external-dns.alpha.kubernetes.io/hostname\":\"udp.mycompany.com\",\"service.beta.kubernetes.io/aws-load-balancer-nlb-target-type\":\"ip\",\"service.beta.kubernetes.io/aws-load-balancer-scheme\":\"internet-facing\",\"service.beta.kubernetes.io/aws-load-balancer-type\":\"external\",\"stunner.l7mp.io/related-gateway-name\":\"stunner/udp-gateway\"},\"ownerReferences\":[{\"apiVersion\":\"gateway.networking.k8s.io/v1beta1\",\"kind\":\"Gateway\",\"name\":\"udp-gateway\",\"uid\":\"8112c527-66a9-455e-a030-4584d45f203f\"}]},\"spec\":{\"ports\":[{\"name\":\"udp-listener\",\"protocol\":\"UDP\",\"port\":3478,\"targetPort\":0}],\"selector\":{\"app\":\"stunner\"},\"type\":\"LoadBalancer\"},\"status\":{\"loadBalancer\":{}}}", "error": "cannot upsert service \"stunner/udp-gateway\": Service \"udp-gateway\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set"}
github.com/l7mp/stunner-gateway-operator/internal/updater.(*Updater).ProcessUpdate
        /workspace/internal/updater/updater.go:115
github.com/l7mp/stunner-gateway-operator/internal/updater.(*Updater).Start.func1
        /workspace/internal/updater/updater.go:62
  • It appears the culprit is on the property loadBalancerClass: service.k8s.aws/nlb present on the LoadBalancer service that gets created based on the GatewayConfig defined below.
  • The service automatically patched by ALB due to the annotations service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip and service.beta.kubernetes.io/aws-load-balancer-type: external

configurations

Here's my GatewayConfig:

apiVersion: stunner.l7mp.io/v1alpha1
kind: GatewayConfig
metadata:
  name: stunner-gatewayconfig
  namespace: ${kubernetes_namespace.stunner.metadata[0].name}
spec:
  realm: stunner.l7mp.io
  authType: plaintext
  userName: "user-1"
  password: "${random_password.stunner_gateway_auth_password.result}"
  loadBalancerServiceAnnotations:
    external-dns.alpha.kubernetes.io/hostname: ${local.udp_gateway_host}.${local.fqdn}
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-type: external

Here's the service that gets provisioned using the above config:

apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: udp.mycompany.com
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-type: external
    stunner.l7mp.io/related-gateway-name: stunner/udp-gateway
  labels:
    stunner.l7mp.io/owned-by: stunner
    stunner.l7mp.io/related-gateway-name: udp-gateway
    stunner.l7mp.io/related-gateway-namespace: stunner
  name: udp-gateway
  namespace: stunner
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.20.21.184
  clusterIPs:
  - 172.20.21.184
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerClass: service.k8s.aws/nlb
  ports:
  - name: udp-listener
    nodePort: 30733
    port: 3478
    protocol: UDP
  selector:
    app: stunner
  type: LoadBalancer

Now, everything appears to be working, but the error is there... I don't know the implications of it.

Hi @calexandre
Thank you for reporting this edge case. The field spec.loadBalancerClass is indeed immutable. And since we do not set the field's value when the gateway operator creates the initial service based on the corresponding Gateway's config, the cloud provider's default value will be used. It seems that after deploying the Service in the cluster, AWS's controller will immediately set this immutable field based on the annotations. Although it should not be modified, they do it anyway. Could not find anything useful based on this online, if you got anything please post it in the comments.

Does this Error get raised only once?
If it works as expected maybe we should leave it as of now and create a note in the docs to raise attention that "AWS....".

@rg0now Maybe in the upcoming releases we could support an additional annotation to set the spec.loadBalancerClass field by default when creating the Service. However, it might be problematic because the user needs to know which `aws-loadbalancer-' combination comes with whatever LoadBalancer class. AND it is just AWS we do not know how other cloud providers handle this.

@davidkornel for now it only affects AWS, but it will change in the future because the spec.loadBalancerClass is part of the service spec from official Kubernetes documentation.

It behaves very much like the spec.ingressClass on the ingress resource.

Now on my end, the stunner error I posted above, is happening on every reconciliation loop (about 1 or 2 times per minute). What are the implications of the error on the stunner side? Without that answer I cannot quantify the severity of the issue...

I asked whether it happens only once or every time to see if it blocks the service update totally or not. If it happens on every reconciliation it blocks the update process totally (I think). As I see from that point we lose control over the service since we cannot update it, however deletion should work. Thus a modification to the Gateway resource should not apply to the service.

I asked whether it happens only once or every time to see if it blocks the service update totally or not. If it happens on every reconciliation it blocks the update process totally (I think). As I see from that point we lose control over the service since we cannot update it, however deletion should work. Thus a modification to the Gateway resource should not apply to the service.

Okay, so if we do not update the gateway, despite the error there should be no problems right?

rg0now commented

Okay, so if we do not update the gateway, despite the error there should be no problems right?

Yes, you're right. But it's still a bug though, thanks a lot for reporting it!

The solution would be to special-case the code that updates Services here:

  • If the Service new Service (which we are about to DeepCopy) contains a nil spec.loadBalancerClass then we should leave the spec.loadBalancerClass on the existing Service intact. This should fix this error.
  • If the new and the existing Service both contain a non-nil spec.loadBalancerClass and the two classes differ, then we should delete the existing Service and recreate it with the new spec.loadBalancerClass (this is a super-intrusive update, public IPs and NodePorts will change, but this should be a relatively rare occurrence).

I'm wondering whether changing controllerutil.CreateOrUpdate to controllerutil.CreateOrPatch would automatically fix this without special-casing. We'll need to do some testing to see this.

Sounds great!
Let me know if you need anything else on my end for reproduction purposes...