kubernetes/kubernetes

[ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints

Closed this issue ยท 24 comments

What would you like to be added:
In-cluster traffic for a Service loadbalancer IP with externalTrafficPolicy=Local should be masqueraded and routed to cluster-wide endpoints. A similar patch was added to the iptables proxier in v1.15 in #77523.

Why is this needed:
Today, the LB IP traffic for the ipvs proxier is not differentiated from "Local" and "External" traffic, so in the case where a Service is using externalTrafficPolicy=Local and traffic is originating in-cluster, the traffic will be black-holed if there are no local endpoints there.

/assign
/sig network

/area ipvs

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

๐Ÿค– I am a bot run by vllry. ๐Ÿ‘ฉโ€๐Ÿ”ฌ

/remove-triage unresolved

Patching kube-proxy with #75262 (comment) solves this issue.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

I tried the proposed fix in #75262 (comment).

From within a POD on a node where no server PODs are executing it works fine, with the source (POD) address preserved.

From an external machine where incoming packets hit a node without server PODs (where it usually is black-holed) packets goes like this;

external-net -> k8s-node without server -> k8s-node with server -> server-POD -> k8s-node with server -> external-net

On the k8s-node without server DNAT happens VIP -> POD-address. The problem is that the reply takes a short-cut (much like a DSR) but the source is now the POD address (where it should have been nat'ed back to the VIP). The reply goes all the way back to the sender who have no idea about any connection with the POD-address as source and drops it.

But if the external traffic setup is correct then external traffic should never hit any nodes where the server PODs are not executing. So IMHO this fix is OK.

Here is a trace from an external source showing the SYN to the VIP address (10.0.0.90) but the SYN-ACK has the POD address (11.0.4.3) as source:

$ tcpdump -ni eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:27:42.244083 IP 192.168.1.201.57116 > 10.0.0.90.80: Flags [S], seq 3799401641, win 64240, options [mss 1460,sackOK,TS val 1840255082 ecr 0,nop,wscale 6], length 0
16:27:42.245033 IP 11.0.4.3.80 > 192.168.1.201.57116: Flags [S.], seq 1075405856, ack 3799401642, win 65160, options [mss 1460,sackOK,TS val 294525933 ecr 1840255082,nop,wscale 7], length 0

Traffic from the main netns (e.g hostNetwork:True) from a k8s-node without server fails for a different reason than traffic from an external machine.

Traffic from main netns on a k8s-node will get the VIP as source! This because the assign of the address to dev kube-ipvs0 sets up a local route as;

$ ip ro show table local
local 10.0.0.90 dev kube-ipvs0 proto kernel scope host src 10.0.0.90
...

SYN is sent OK (e.g 10.0.0.90:36142 -> 10.0.0.90:80) but the reply to the VIP address fails.

This can be "fixed" by changing the route and set the node-ip as src;

$ ip ro replace 10.0.0.90 dev kube-ipvs0 src 192.168.1.2 table local

Then it works.

@andrewsykim Is it possible to add the proposed solution in #75262 (comment) in kube-proxy?

It would solve the problem for all users except for PODs with hostNetwork:True and have no significant draw-backs as far as I can tell.

@uablrek I will take a look at that PR soon

@jpiper Would you mind if I create a PR with your update described in #75262 (comment) ?

Or would you prefer to do it yourself?

ssup2 commented

Hello. To solve this problem, I developed a kubernetes controller called node-network-manager. node-network-manager works on all kubernetes nodes and adds iptables DNAT rules from CLUSTER-IP to Load Balancer IP. Please try this and give me feedback. Thanks.

https://github.com/kakao/network-node-manager

The CI test for this is temporary disabled

kubernetes/test-infra#21447

It should be re-enabled when this issue is solved

/area kube-proxy

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

/remove-lifecycle stale

There is a pending PR #97081

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jc2k commented

/remove-lifecycle stale

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

There is a pending PR #97081

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@uablrek hello.

Please tell me. Did you manage to reach external-ip from host node ? Without a way to remove or replace the route and with src ip ?

I have the same problems.

When accessed from POD, it works ok.

But when accessing from the host node to external-ip, it does not work.

Replacing or removing the route from external-ip on the kube-ipvs0 interface helps.

Is there any working solution for this?

@nightguide No, sorry. I think that the route must be replaced to make it work, as described in #93456 (comment). But that is a much more intrusive (i.e dangerous) update than the tiny fix in #97081.