kubernetes/kubernetes

externalTrafficPolicy:Local and proxy-mode=ipvs blackholes traffic on

Closed this issue ยท 46 comments

@kubernetes/sig-network-bugs

What happened:
This is similar/follow on from #71596

When using externalTrafficPolicy:Local and proxy-mode=ipvs, kube-proxy is creating an IPVS entry with no endpoints on nodes that don't host that workload. i.e.

TCP  10.1.109.187:443 rr
$ ip ro get 10.1.109.187
local 10.1.109.187 dev lo table local src 10.1.109.187 
    cache <local> 

Which blackholes all traffic from this node (and pods on this node) trying to access the loadbalancer using the external IP.

What you expected to happen:

On nodes where no workload is present, no IPVS entry should be made.

Environment:
kube-proxy version 1.14.0-beta.1

/sig network
/area ipvs

It's been a while since I dug into the kube-proxy code but IIRC this behaviour is expected. Local only traffic should be blackholed if there are no endpoints on that node. This is because we have no way (yet) to gauge if traffic originates from an external or internal source, so if externalTrafficPolicy is set to Local we have to assume that traffic to that service is always routed locally, otherwise we explicitly "drop" rather than let the other side of the connection hang forever.

There's a proposal here that aims to solve this https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0033-service-topology.md

/assign

@andrewsykim Ah yes, I can see here

https://kubernetes.io/docs/tutorials/services/source-ip/

If there are no local endpoints, packets sent to the node are dropped, so you can rely on the correct source-ip in any packet processing rules you might apply a packet that make it through to the endpoint.

In this case this is frustrating for me as it means I am unable to speak to a loadBalancer IP with externalTrafficPolicy:local from within a pod hosted on a node without a local endpoint.

Do you know where in the codebase this blackholing occurs? I could just recompile without this enabled for my use case :)

However, it does seem like an issue that if you have an LB IP on a single node, this means that none of the pods on the other nodes can speak via this LB ip... Is this behaviour more acceptable than allowing misrouted connections to hang?

Do you know where in the codebase this blackholing occurs? I could just recompile without this enabled for my use case :)

My guess would be here but I would advise against this since you may unexpectedly change other parts of the network by doing this.

However, it does seem like an issue that if you have an LB IP on a single node, this means that none of the pods on the other nodes can speak via this LB ip... Is this behaviour more acceptable than allowing misrouted connections to hang?

There is a lot of context to unravel here to answer this question properly :P but overall yes because in most cases the connection wouldn't be misrouted. The proxy rules have no way of knowing if traffic is originating from the local node, another node in the cluster or from an external LB so we're requiring user input for the proxy to do the right thing here. I think realistically there could be a way for kube-proxy to account for these edge cases but it would get really messy/complicated. I'm hopeful that https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0033-service-topology.md will address this though :)

Here are 2 options I can think of off the top of my head that might be helpful for now:

  1. Create 2 services for your deployment, one internal and one external. The internal one would use externalTrafficPolicy: Cluster and you would use the Cluster IP for that. For the external one you would set externalTrafficPolicy: Local and use the LB IP.
  2. Use pod affinity rules to make sure that a deployment that needs to talk to your application is always scheduled on the same node. This has its quirks as well but it may fit your use case.

Hope that helps!

On nodes where no workload is present, no IPVS entry should be made.

You should probably also remove the loadBalancerIP from the kube-ipvs0 device. The loadBalancerIP would still not be usable on a local node without endpoints but you will probably get an ICMP or RST back instead of nothing (black-hole).

I thought the same applied for proxy-mode=iptables. Is it really different?

On second thought it may be a bad idea to remove the loadBalancerIP from the kube-ipvs0 device. An incoming packet would be routed, probably to the default route and back again, until the TTL runs out.
But if the loadBalancerIP remains the kube-ipvs0 and there is no ipvs entry the packet is treated by the local stack on the node.

I'm using metallb to implement the LoadBalancer type - after reading all these comments I think with my setup, the best way forward is a simple one-line modification that forces MetalLB to work in "local" mode even when the externalTrafficPolicy is set to cluster.

/triage support

/close

Closing this for now given the proxy is behaving as expected.

@andrewsykim: Closing this issue.

In response to this:

/close

Closing this for now given the proxy is behaving as expected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In the end there was a dirty hack, which was to add this code here

if len(newEndpoints) == 0 && onlyNodeLocalEndpoints {
	for _, epInfo := range proxier.endpointsMap[svcPortName] {
		newEndpoints.Insert(epInfo.String())
	}
}

kube-proxy runs in a "hybrid" mode - on nodes without local active endpoints it sets up an IPVS balancer (like in cluster mode) but on nodes where an active endpoint exists it only uses the local node.

@thockin This is the "behaviour" I mentioned after the Network SIG talk at kubecon

Couldn't the kube-proxy just not create the VS when the ip is not in the service-ip range of the cluster and there is no local endpoint and the traffic policy is Local?
That would at least help if the external loadbalancer range is different than the k8s service ip range and pod ip range.

The ipvs mode kube-proxy behavior seems to also be inconsistent with the iptables mode.
With iptables using external loadbalancer ips and traffic policy Local works just fine for traffic from inside the cluster to the loadbalancer ips, at least it does in our setup using metallb.

It seems as this was fixed in iptables mode with #77523. Why would this not be reflected in IPVS as well?

@salanki I guess someone needs to open a PR for the equivalent IPVS code change

I think this is trickier for the IPVS case vs iptables. In iptables we use --src-type LOCAL to NAT only if the traffic originates locally. This ensures that traffic from an external source still gets dropped which is required as part of externalTrafficPolicy: Local. This way a node without endpoints fails health checks from an external LB.

On nodes where no workload is present, no IPVS entry should be made.

If we remove IPVS entries entirely, we still have to ensure that traffic via the node port OR the LB IP is actually dropped, otherwise an external load balance might think a node is healthy and has traffic for an endpoint when it doesn't. As far as I can tell, there's no trivial way to do this in IPVS mode without layering iptables on top. Maybe this is another instance where we need to supplement IPVS with some iptable rules, but that should be a last resort IMO.

kube-proxy runs in a "hybrid" mode - on nodes without local active endpoints it sets up an IPVS balancer (like in cluster mode) but on nodes where an active endpoint exists it only uses the local node.

Sounds like this will be solved with Service Topology - kubernetes/enhancements#536

Agree with @juliantaylor that not creating the VS at all would at least allow BGP routing do its job and allow to send packets to proper nodes or maybe the choice to blackhole should only be an option, not the rule...

Sorry for commenting on a closed issue, but if I understand correctly, when using "externalTrafficPolicy: Local" as a setting in our service resource files, if there are no pods on a node in the cluster, the traffic will still get routed to that node and end up dropped?

For example if I have a multi-node k8s cluster fronted by an internet-facing ELB connected to nginx pods acting as reverse proxies to my services, path-based routing also included via nginx ingress, then I could be losing packets on nodes that don't have pods of the destination service - in other words, do I have to have at least one pod per node for each of my services?

I appreciate any answers up front, sorry if it's a dumb question, I'm still relatively new at this.

jbg commented

@Erokos the health checks would fail on the nodes with no pods, so traffic wouldn't be routed to them.

jbg commented

@Erokos the health checks would fail on the nodes with no pods, so traffic wouldn't be routed to them.

We're seeing that in most cloud providers, there's a delay (in ~seconds) between when the health check fails and when traffic is reroute, so there could be traffic loss there. See #85643 for more details. Currently working on a KEP to improve this.

Does service topology solve this? Will it no longer create blackholing IPVS entries?

Service topology would offer a work around for this issue ("fall back to addresses in the same zone/region") but the actual fix is a bit more involved and is proposed here kubernetes/enhancements#1607.

In the end there was a dirty hack, which was to add this code here

if len(newEndpoints) == 0 && onlyNodeLocalEndpoints {
	for _, epInfo := range proxier.endpointsMap[svcPortName] {
		newEndpoints.Insert(epInfo.String())
	}
}

kube-proxy runs in a "hybrid" mode - on nodes without local active endpoints it sets up an IPVS balancer (like in cluster mode) but on nodes where an active endpoint exists it only uses the local node.

Hello, does anyone see any problem with this solution? It sounds good to me, at this point i'm inclined to just fork kube-proxy and put this type of solution.

@malozanoff: Please publish a build to Docker Hub if you do, would be very appreciated.

On second thought it may be a bad idea to remove the loadBalancerIP from the kube-ipvs0 device. An incoming packet would be routed, probably to the default route and back again, until the TTL runs out.
But if the loadBalancerIP remains the kube-ipvs0 and there is no ipvs entry the packet is treated by the local stack on the node.

The router should not route the traffic to the node where there is no pod, if it does - it brakes the concept of Local traffic policy and loadbalancer implementation is faulty. but you right, TTL here is exactly to avoid the traffic storm if occurs.

Now the situation is worth... Kubernetes pods couldn't reach the services the same cluster exposes... In fact, one team might not reach the services of another team b/c they are on the same cluster.

It seems @jpiper proposal should work just fine.

@malozanoff did you ever build this?

Iโ€™m surprised that there is no movement on this issue and that it is still closed. Is wanting to be able to access LoadBalancer services from within the cluster so unusual?

#93456 for LB IP reachability from in-cluster

So there are two separate issues being discussed in this thread I think:

  1. External traffic can be sent to a node that will black-hole traffic since there are no endpoints, this can happen between loadbalancer health checks. This particular case of traffic being black-holed would be solved with this KEP. More details in #85643.
  2. The ipvs proxier doesn't differentiate "Local" and "External" traffic for the LB IP, so in-cluster traffic is also dropped. #93456 should address this. The current "workaround" is to use cluster IP for internal cluster traffic.

Will try to make sure these get into v1.20. I think @jpiper's suggestion in #75262 (comment) is almost correct, except it will mean external traffic will not always be sent to "Local" endpoints. We may need to add iptables rules to drop non-local traffic in that specific case so that external traffic sent to a node with no local endpoints is still black-holed.

Funny this thread revived, i was working on this on Friday, i actually built a fork with the changes mentioned above, i now get endpoints in ipvs for all pods healthy on nodes that don't have a local endpoint (as if it was a clusterip) and only the local endpoint if there's one, but strangely enough, it did not solve the issue.
I'm guessing there might some other issue at play, i'm using metallb so maybe it's causing routing issues with bgp routes... Need more debugging.

In any case, i would LOVE to have a proper upstream solution for this.

If i end up getting the change to work i'll update the ticket.

@malozanoff: Can you push your build to docker hub or similar? Would love to try it on my end for a #93456 workaround

If you're desperate for a working solution now, you're probably better off trying the Service Topology feature as opposed to building your own kube-proxy.

Here's an example Service that prefers node-local endpoints if it exists, otherwise it will fallback to cluster-wide endpoints:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376
  topologyKeys:
    - "kubernetes.io/hostname"
    - "*"

Ensure feature gates are enabled, see https://kubernetes.io/docs/tasks/administer-cluster/enabling-service-topology/ for more details.

I do not understand that workaround. Can I slap a serviceTopology like that on a LoadBalancer with externalTrafficPolicy: Local and it will magically fix the black-holing issue?

You have to enable features gates and unfortunately service topology and externalTrafficPolicy are mutually exclusive so you can't set it when using service topology.

By using the above topologyKeys, the black-hole issue is fixed because of the * fallback on any node that doesn't have endpoints based on the first topology key kubernetes.io/hostname which refers to the local node. So a node will either proxy to local endpoints (topology kubernetes.io/hostname) if it exists, otherwise fallback to all cluster-wide endpoints (topology *).

The alternative "workaround" is to use cluster IPs for all internal traffic -- but I understand this isn't always possible or desirable.

But if I can't use externalTrafficPolicy, I won't have externalTrafficPolicy, so it will not be any different than using externalTrafficPolicy from the perspective of where external traffic from the Internet ends up. The key use-case of externalTrafficPolicy: Local for us is to make sure that ingress traffic from outside the cluster hits a node that has the service running, so it doesn't have to pass an extra hop (think multi Gbps video traffic).

Yup that would be one of the limitations of using this. You would run into the same problem if you backported the fix mentioned in #75262 (comment) though and built your own kube-proxy. This was mostly a recommendation for folks who are willing to go out of there way and rebuild kube-proxy with that fix anyways.

The correct fix is to use cluster IP for internal traffic or wait til there's a PR for #93456.

I think #75262 (comment) would work though, as it would still be a LoadBalancer service with externalTrafficPolicy: Local so it would only be announced via MetalLB from the nodes that actually have the endpoints locally. Other nodes would not announce it and would not get external traffic towards it. I might be wrong, but building a kube-proxy and testing it out will prove that if I can't find another solution.

I think #75262 (comment) would work though, as it would still be a LoadBalancer service with externalTrafficPolicy: Local so it would only be announced via MetalLB from the nodes that actually have the endpoints locally.

Ah, good to know!

I might be wrong, but building a kube-proxy and testing it out will prove that if I can't find another solution.

Please do and report back. I will also try to get a PR ready for #93456 this week and share an image to test with here.

I truly appreciate your attention to this.

Yup that would be one of the limitations of using this. You would run into the same problem if you backported the fix mentioned in #75262 (comment) though and built your own kube-proxy. This was mostly a recommendation for folks who are willing to go out of there way and rebuild kube-proxy with that fix anyways.

Mmh, i don't see why this would also be a problem in those cases, the nodes with local endpoints are unchanged and the external routes (think bgp) are unrelated to kube-proxy in the first place, so i don't think you'd have this problem with the kube-proxy fix...

The correct fix is to use cluster IP for internal traffic or wait til there's a PR for #93456.

I can confirm that #75262 (comment) solves the issue for me.

ssup2 commented

Hello. To solve this problem, I developed a kubernetes controller called node-network-manager. node-network-manager works on all kubernetes nodes and adds iptables DNAT rules from CLUSTER-IP to Load Balancer IP. Please try this and give me feedback. Thanks.

https://github.com/kakao/network-node-manager