Stunner gateway operator `ERROR updater cannot update service` on AWS + EKS + ALB due to `"cannot upsert service \"stunner/udp-gateway\": Service \"udp-gateway\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set"`
calexandre opened this issue · 7 comments
Hey all,
I'm getting this error when provisioning Stunner on AWS + EKS + ALB.
The error seems pretty straightforward. Here's my stunner
logs:
2023-10-08T19:36:50.449897717Z ERROR updater cannot update service {"operation": "unchanged", "service": "{\"metadata\":{\"name\":\"udp-gateway\",\"namespace\":\"stunner\",\"creationTimestamp\":null,\"labels\":{\"stunner.l7mp.io/owned-by\":\"stunner\",\"stunner.l7mp.io/related-gateway-name\":\"udp-gateway\",\"stunner.l7mp.io/related-gateway-namespace\":\"stunner\"},\"annotations\":{\"external-dns.alpha.kubernetes.io/hostname\":\"udp.mycompany.com\",\"service.beta.kubernetes.io/aws-load-balancer-nlb-target-type\":\"ip\",\"service.beta.kubernetes.io/aws-load-balancer-scheme\":\"internet-facing\",\"service.beta.kubernetes.io/aws-load-balancer-type\":\"external\",\"stunner.l7mp.io/related-gateway-name\":\"stunner/udp-gateway\"},\"ownerReferences\":[{\"apiVersion\":\"gateway.networking.k8s.io/v1beta1\",\"kind\":\"Gateway\",\"name\":\"udp-gateway\",\"uid\":\"8112c527-66a9-455e-a030-4584d45f203f\"}]},\"spec\":{\"ports\":[{\"name\":\"udp-listener\",\"protocol\":\"UDP\",\"port\":3478,\"targetPort\":0}],\"selector\":{\"app\":\"stunner\"},\"type\":\"LoadBalancer\"},\"status\":{\"loadBalancer\":{}}}", "error": "cannot upsert service \"stunner/udp-gateway\": Service \"udp-gateway\" is invalid: spec.loadBalancerClass: Invalid value: \"null\": may not change once set"}
github.com/l7mp/stunner-gateway-operator/internal/updater.(*Updater).ProcessUpdate
/workspace/internal/updater/updater.go:115
github.com/l7mp/stunner-gateway-operator/internal/updater.(*Updater).Start.func1
/workspace/internal/updater/updater.go:62
- It appears the culprit is on the property
loadBalancerClass: service.k8s.aws/nlb
present on theLoadBalancer
service that gets created based on theGatewayConfig
defined below. - The service automatically patched by ALB due to the annotations
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
andservice.beta.kubernetes.io/aws-load-balancer-type: external
configurations
Here's my GatewayConfig
:
apiVersion: stunner.l7mp.io/v1alpha1
kind: GatewayConfig
metadata:
name: stunner-gatewayconfig
namespace: ${kubernetes_namespace.stunner.metadata[0].name}
spec:
realm: stunner.l7mp.io
authType: plaintext
userName: "user-1"
password: "${random_password.stunner_gateway_auth_password.result}"
loadBalancerServiceAnnotations:
external-dns.alpha.kubernetes.io/hostname: ${local.udp_gateway_host}.${local.fqdn}
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-type: external
Here's the service that gets provisioned using the above config:
apiVersion: v1
kind: Service
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: udp.mycompany.com
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-type: external
stunner.l7mp.io/related-gateway-name: stunner/udp-gateway
labels:
stunner.l7mp.io/owned-by: stunner
stunner.l7mp.io/related-gateway-name: udp-gateway
stunner.l7mp.io/related-gateway-namespace: stunner
name: udp-gateway
namespace: stunner
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 172.20.21.184
clusterIPs:
- 172.20.21.184
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
loadBalancerClass: service.k8s.aws/nlb
ports:
- name: udp-listener
nodePort: 30733
port: 3478
protocol: UDP
selector:
app: stunner
type: LoadBalancer
Now, everything appears to be working, but the error is there... I don't know the implications of it.
Hi @calexandre
Thank you for reporting this edge case. The field spec.loadBalancerClass
is indeed immutable. And since we do not set the field's value when the gateway operator creates the initial service based on the corresponding Gateway's config, the cloud provider's default value will be used. It seems that after deploying the Service in the cluster, AWS's controller will immediately set this immutable field based on the annotations. Although it should not be modified, they do it anyway. Could not find anything useful based on this online, if you got anything please post it in the comments.
Does this Error get raised only once?
If it works as expected maybe we should leave it as of now and create a note in the docs to raise attention that "AWS....".
@rg0now Maybe in the upcoming releases we could support an additional annotation to set the spec.loadBalancerClass
field by default when creating the Service. However, it might be problematic because the user needs to know which `aws-loadbalancer-' combination comes with whatever LoadBalancer class. AND it is just AWS we do not know how other cloud providers handle this.
@davidkornel for now it only affects AWS, but it will change in the future because the spec.loadBalancerClass
is part of the service
spec from official Kubernetes documentation.
It behaves very much like the spec.ingressClass
on the ingress
resource.
Now on my end, the stunner
error I posted above, is happening on every reconciliation loop (about 1 or 2 times per minute). What are the implications of the error on the stunner
side? Without that answer I cannot quantify the severity of the issue...
I asked whether it happens only once or every time to see if it blocks the service update totally or not. If it happens on every reconciliation it blocks the update process totally (I think). As I see from that point we lose control over the service since we cannot update it, however deletion should work. Thus a modification to the Gateway resource should not apply to the service.
I asked whether it happens only once or every time to see if it blocks the service update totally or not. If it happens on every reconciliation it blocks the update process totally (I think). As I see from that point we lose control over the service since we cannot update it, however deletion should work. Thus a modification to the Gateway resource should not apply to the service.
Okay, so if we do not update the gateway, despite the error there should be no problems right?
Okay, so if we do not update the gateway, despite the error there should be no problems right?
Yes, you're right. But it's still a bug though, thanks a lot for reporting it!
The solution would be to special-case the code that updates Services here:
- If the Service new Service (which we are about to DeepCopy) contains a nil
spec.loadBalancerClass
then we should leave thespec.loadBalancerClass
on the existing Service intact. This should fix this error. - If the new and the existing Service both contain a non-nil
spec.loadBalancerClass
and the two classes differ, then we should delete the existing Service and recreate it with the newspec.loadBalancerClass
(this is a super-intrusive update, public IPs and NodePorts will change, but this should be a relatively rare occurrence).
I'm wondering whether changing controllerutil.CreateOrUpdate
to controllerutil.CreateOrPatch
would automatically fix this without special-casing. We'll need to do some testing to see this.
Sounds great!
Let me know if you need anything else on my end for reproduction purposes...