kubernetes/kubernetes

Why kube-proxy add external-lb's address to node local iptables rule?

BSWANG opened this issue Β· 62 comments

/kind friction

What happened:
I have a LoadBalancer type service A of address 1.1.1.1. The external loadbalancer of service A is a TLS decoder, it will convert https requests to http hostport and endpoint. But since the kube-proxy add the external-lb's address to local iptables rule. Requests of https//1.1.1.1 will hijack to local http endpoints. Then https request failed.

What you expected to happen:
Kube-proxy don't add external-lb's address to local iptables. And the request will go through external-lb.

Environment:

  • Kubernetes version (use kubectl version):
    1.10.4
  • Cloud provider or hardware configuration:
    Alibaba Cloud
  • OS (e.g. from /etc/os-release):
    Centos 7.4
  • Kernel (e.g. uname -a):
    3.10.0-693
  • Install tools:
    kubeadm

/sig network

I think the reason is, traffic from in cluster have no need to go out side(lb) then come back.
And probably lb should not do TLS decoder work.

@Lion-Wei Will it be an option to config this behavior?Some functions like TLS,Monitor,Logging doing on lb are more reasonable and high performance.
If kube-proxy add this option, I would like to commit a PR to it.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

We're also seeing this as an issue at DigitalOcean. It's a concern not just for load-balancer TLS termination, but also for supporting proxy protocol as encouraged in ( https://kubernetes.io/docs/tutorials/services/source-ip/ ). Traffic addressed to the LB ip from within the cluster never reaches the load-balancer, and the required proxy header isn't applied, causing a protocol violation.

The in-tree AWS service type loadbalancer supports proxy protocol and TLS termination, but because they populate status.loadbalancer.ingress.hostname rather than .ip they avoid this bug/optimization.

We're willing to put together a PR to address this there's interest from sig-network to accept it. We've considered a kube-proxy flag to disable the optimization, or the more complex option of extending v1.LoadBalancerIngress to include feedback from the cloud provider.

bowei commented

It seems like what is being implemented here is not modeled all that well in the current kube-proxy construct; your LB is not transparent from an L3/4 perspective and lives somewhere other than the node (it is terminating TLS, adding a proxy protocol field). It seems possible for nothing to break if we remove the external LB IPs from the node as a provider-specific setting, but it will require some thought.

If I’m being quite honest this behavior goes against the principle of least surprise. If I want kube-proxy to route my traffic I would use a ClusterIP service or similar and configure the application to use it, if I am hitting a LoadBalancer provided by a cloud provider there is likely to be a specific reason for that and kube-proxy shouldn’t try to second guess it.

johnl commented

We hit this same problem at Brightbox with our cloud controller manager. Our Load Balancers can terminate SSL and support the PROXY protocol. Confirmed with both ipvs and iptables kube-proxy modes. Any IPs we provide as LoadBalancerIngress are used to redirect outgoing connections and keep them within the cluster, breaking things.

To put it more clearly (imo), the problem is that kube-proxy assumes all load balancer services are straight TCP/UDP and that it can optimize them internally but that is not the case.

So either the assumption is wrong and kube-proxy needs fixing, or the assumption is right and external load balancers should not be doing these fancy things. I obviously think kube-proxy is wrong here :)

As @jcodybaker says above, AWS avoids this problem simply but not listing IP addresses as LoadBalancerIngress, only hostnames, which kube-proxy doesn't resolve.

We've changed our cloud controller to do the same, listing only hostnames in there, and it's fixed our problem too.

This feels a bit like a hack, but maybe it could be argued that's the way for a cloud controller to enable/disable the optimisation - a bit misguiding though.

We were also exposing the IPv6 addresses of our load balancers too, which actually didn't look properly supported further down the stack at all, so things do seem to be a bit of a mess here generally.

So a kube-proxy flag to disable the optimization would be a great start, but I think this all needs a closer look!

Plus, it's worth noting that in ipvs mode, the IP address is actually added locally to each node so it's not possible to selectively support this per-port. As soon as a load balancer lists the IP, it's completely unreachable directly.

If I create an external loadbalancer I don't expect kube-proxy to hijack traffic addressed to that external loadbalancer.

That completely disregards any functionality the load balancer provides, including TLS termination, proxy protocol, logging, metrics, DPI firewalling, and... the actual the load balancing algorithm!

I'm perplexed that this is default behavior? It only seems useful if your external load balancer doesn't actually do anything useful πŸ˜„

As @jcodybaker says above, AWS avoids this problem simply but not listing IP addresses as LoadBalancerIngress, only hostnames, which kube-proxy doesn't resolve. I'd like to be able to lock this out on kube-proxy in general for cluster deployments. This behavior is only going to break things.

Thanks for raising this @BSWANG. This submerged iceberg could have spoiled my day one day, but now I know to watch out for this πŸ‘€

This issue was discussed a few weeks ago at a SIG Networking meeting. The problem was acknowledged, and the SIG is open to review a PR that demonstrates what a possible fix could look like. (IIRC, @thockin said to believe he was explicitly trying to prevent the kind of LB-bypassing behavior that can be observed right now when he originally built out kube-proxy.)

We (DigitalOcean) have volunteered to submit a PR but probably won't get to it prior to KubeCon EU. If anyone feels like attacking this earlier, feel free to (and please report back here to let us know of your intent).

That's great @timoreimann, if it wasn't intentional then hopefully the PR can change the current default behavior. But for now we (1) know about it, and (2) have a workaround, so all good.

Feel free to send me your KubeCon tickets, if you want more time to stay and work on a patch πŸ˜‰

I've started working on a fix for this #65387 which I think is related? (please correct me if I'm wrong).

/assign

I have a potential fix for this here #77523. The caveat is that internal traffic will have to go through SNAT but this is okay since source IP preservation only applies for external traffic anyways. Patch needs more validation/testing.

Thank you for your work @andrewsykim! A example use case for your patch is cert-manager which accesses the external load balancer to verify the ACME HTTP01 auth file. There has been an open cert-manager issue for over a year, as we had no idea why cert-manager was broken for just some users. It looks like this behavior was the culprit!

Now that #77523 is merged, should this issue be closed? (See this comment)

I made a comment in the wrong issue about this; #80297 (comment).

Note that the reverse is also wanted by other users, see #69845.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

foux commented

Any news on this one?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Don't close this issue .
For kube-proxy in iptable mode, we can simply add iptables -t nat -I PREROUTING -d $LB_IP -j RETURN to workaround this .
But for kube-proxy in IPVS mode, we have no idea how to stop kube-proxy's 'hijack' behavior from external load balancer .

@timoreimann , looking forward to your fix.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Is there a workaround for this problem? (I couldn't find it on Google) It's a real bummer that I can't access my own domain name from the services that provide content for the domain. I'd like to access other services/pods by using the domain name, because the project works this way and it's weird that it doesn't work out of the box.

For the record, I'm using a DigitalOcean external LoadBalancer with Ingress for my subdomain routing. I've seen the do-loadbalancer-hostname, but this fixes the problem per subdomain, which requires a loadbalancer per subdomain, which isn't optimal.

As far as I know there is no clean workaround. @bartversluijs I had the same conclusion - I don't want a load balancer per subdomain simple to hit my own site.

@jekakm Can work if you have a small number of sites, but not that great if you have hundred/thousand of hosts.

@jekakm Yeah, I've mentioned that possibility, but I have multiple domains in my Ingress, and the nginx-ingress handles my LoadBalancer. If I've to setup multiple LoadBalancers for all my subdomains with that option, and add an nginx pod to handle the multiple subdomains, it's going to be an expensive mess. Besides that, it's weird that it's even needed IMHO.

By the way, using the hostname instead of the IP address has an unfortunate side-effect for us.

We set up our Kubernetes clusters (on DigitalOcean) using Terraform. However, when using the hostname, the IP address is not available. So we cannot use it in our Terraform DNS resources to have the domains pointed to the cluster load balancer.

We work around that by using the DO load balancer REST API as a data source, but apart from being ugly and requiring extra credentials, this means the IP address is not available during the Terraform plan phase and it never knows whether those resources will change.

@bartversluijs @martinlevesque

Unless I'm missing something about your issue/requirements (entirely possible), do-loadbalancer-hostname fixes the issue per loadbalancer, not per subdomain. You shouldn't have to change your loadbalancer/ingress structure at all.

The DO documentation is rather confusing on this point, but it doesn't actually matter what hostname you provide for do-loadbalancer-hostname. It's entirely unrelated to whatever domains/subdomains you want to point at the loadbalancer and route with Ingress.

As long as the hostname provided resolves to the loadbalancer IP the loadbalancer/k8s service/ingress will work, and trigger the side effect of not adding a local iptables route for the loadbalancer IP, fixing the routing issue.

I run multiple subdomains - and multiple entirely separate domains - through a single DO loadbalancer with ingress, using a single do-loadbalancer-hostname entry. The hostname I use is a subdomain I created specifically for this purpose, but it could be anything I control DNS for.

@braedon So correct me if I'm wrong, but I can put, for example, ingress.mydomain.com as a hostname and from there on, I can resolve every domain I've put in my Ingress file. Even if it's test.mydomain.com, because it doesn't add the iptables rules in the proxy?

This would probably solve my problem. Only thing is, does it work when I add the annotation do-loadbalancer-hostname to the Ingress file, or how am I supposed to add this annotation to the LoadBalancer file while using an Ingress Controller who automatically creates a LoadBalancer?

@bartversluijs Yip, adding do-loadbalancer-hostname: "ingress.mydomain.com" works by causing k8s to not add a problematic iptables rule, so it fixes internal routing for any domain/subdomain, no matter what domain is used to do it. Once it's in place, any internal traffic to the loadbalancer external IP (regardless of the hostname that resolved to it) will be correctly routed to the loadbalancer external IP, rather than shortcutting to the ingress service IP.

Actually setting up ingress.mydomain.com initially is a little tricky, as you note. The DO documentation recommends getting the ingress controller set up without do-loadbalancer-hostname set first. Once the loadbalancer is up, add the DNS entry pointing to the IP address, and then update the controller's service to add do-loadbalancer-hostname. This worked fine for me when I did it manually, though it may be tricky to implement with devops tools...

@braedon I'm trying this at the moment. It's definitely something to keep in mind to add it after the LoadBalancer has been deployed and update it. But my only problem is how to add it to my Ingress controller. Because the Ingress Controller creates a LoadBalancer internally (using helm), so I can't change it. So I'm figuring out if it's possible to add the annotation to the Ingress.yml file. Or is this impossible?

@bartversluijs Ahh, I see. I'm pretty sure the annotation can only be on the LoadBalancer Service itself - it changes how DO configures the loadbalancer, so isn't specific to any ingress rule.

What helm chart are you using? It should have an option to set the service annotations. e.g. it looks like the stable/nginx-ingress chart has a controller.service.annotations object.

Adding this to Ingress isn't possible. So you've to apply your own values.yaml to the Ingress you're using.
Look for (probably) service:. The annotations will be empty. If you add service.beta.kubernetes.io/do-loadbalancer-hostname: "YOUR_HOSTNAME" as an annotation to the service, it works!

Thanks @braedon for commenting. I had given up on this until you came along!

FYI there's progress happening in this KEP kubernetes/enhancements#1392

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

The hostname workaround doesn't work if you're creating DNS records with external-dns.
external-dns picks up the load balancer DNS instead of the IP and since it's not an A record, it just doesn't create it.
Any workaround for this case?

I'm using coredns, what's the solution to this?

and metallb + traefik and pihole to store my dns routes

Just to let everybody know who's looking at this issue and might think it got fixed because the issue is "Closed".
The PR that fixed this was actually reverted later: #96454

Problem is still there in Kubernetes v1.20.
https://github.com/compumike/hairpin-proxy can help as workaround.

hairpin-proxy works but had unexpected consequences in my usecase, cert-manager/cert-manager#3749

hairpin-proxy works but had unexpected consequences in my usecase, jetstack/cert-manager#3749

That's only an issue if you need both dns01 and http01 challenges to work, right?
As per my understanding the dns01 challenges should not be needing the hairpin-proxy workaround at all.

/reopen

@BSWANG: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BSWANG: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

reopen due this pr #96454 reverted

aojea commented

reopen due this pr #96454 reverted

the revert was done because that PR merged accidentally, to avoid misunderstandings, is not that the project doesn't want to implement, is that there are some process to add new features and that PR didn't follow them.
The KEP to implement the change has been approved https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1860-kube-proxy-IP-node-binding

@Sh4d1 are you still working on this #97681? if you want to include this in 1.22, the process has changed, please see https://groups.google.com/g/kubernetes-sig-network/c/3NYVEgvE1Uc/m/o68JAIt0DAAJ for more details

Sh4d1 commented

@aojea yep! Layed the groundwork on #101027 but I was waiting on Tim's review, tough I understood he's been a bit busy! ( You're welcome if you want to review!).
Once this one is merged, I can rebase #97681 and it should be good.

Thanks, I'll read that πŸ˜„

Commenting for support/+1 that self-referencing the cluster from within would be useful

@mkreidenweis-schulmngr 's hair-pin solution (kind of) worked but didn't pick up the correct name of the pod(s) to reroute to and would have worked with further digging.
Setting up a separate cluster solved the issue in the meantime.

Thanks for your work on this

#101027 was not assigned to me, so I missed it.

I'm going to close this as a dup, since we do have the KEP open

how is the progress ?

Is there any solution for this?

Hey @thockin πŸ‘‹

I'd be happy to help pushing this over the finishing line. Is there a place to refresh my memory on where we stand right now and what remains to be done?

I vaguely remember that time was spent on a KEP but don't recall the details.

Appreciate any starting pointers.

Sh4d1 commented

@timoreimann πŸ‘‹ you can read kubernetes/enhancements#1860 (comment)
and Tim opened #106242 instead of #101027 .

So if nothing really changed, my last message still stands I guess!