kubernetes/kubernetes

kube-proxy currently incompatible with `iptables >= 1.8`

drags opened this issue Β· 82 comments

drags commented

What happened:

When creating nodes on machines with iptables >= 1.8 kube-proxy is unable initialize and route service traffic. The following is logged:

kube-proxy-22hmk kube-proxy E1120 07:08:50.135017       1 proxier.go:647] Failed to ensure that nat chain KUBE-SERVICES exists: error creating chain "KUBE-SERVICES": exit status 3: iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
kube-proxy-22hmk kube-proxy Perhaps iptables or your kernel needs to be upgraded.

This is compat issue in iptables which I believe is called directly from kube-proxy. This is likely due to module reorganization with iptables move to nf_tables: https://marc.info/?l=netfilter&m=154028964211233&w=2

iptables 1.8 is backwards compatible with iptables 1.6 modules:

root@vm77:~# iptables --version
iptables v1.6.1
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables: No chain/target/match by that name.
root@vm77:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.



root@vm83:~# iptables --version
iptables v1.8.1 (nf_tables)
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.6 iptables -t nat -Ln
iptables v1.6.0: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
root@vm83:~# docker run --cap-add=NET_ADMIN drags/iptables:1.8 iptables -t nat -Ln
iptables: No chain/target/match by that name.

However kube-proxy is based off of debian:stretch which iptables-1.8 may only make it to as part of stretch-backports

How to reproduce it (as minimally and precisely as possible):

Install a node onto a host with iptables-1.8 installed (ex: Debian Testing/Buster)

Anything else we need to know?:

I can keep these nodes in this config for a while, feel free to ask for any helpful output.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.4", GitCommit:"bf9a868e8ea3d3a8fa53cbb22f566771b3f8068b", GitTreeState:"clean", BuildDate:"2018-10-25T19:06:30Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}```
  • Cloud provider or hardware configuration:

libvirt

  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux buster/sid"
NAME="Debian GNU/Linux"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux vm28 4.16.0-1-amd64 #1 SMP Debian 4.16.5-1 (2018-04-29) x86_64 GNU/Linux
  • Install tools:

kubeadm

  • Others:

/kind bug

drags commented

/sig network

drags commented

@kubernetes/sig-network-bugs

@drags: Reiterating the mentions to trigger a notification:
@kubernetes/sig-network-bugs

In response to this:

@kubernetes/sig-network-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

For the record, this probably breaks at least Calico and Weave as well, based on my abject failures to get pod<>pod networking to function on Debian Buster (which has upgraded to iptables 1.8). I'm filing bugs for that now, but this breaking change to iptables may be worth a wider broadcast to the k8s community.

kube-proxy itself seems compatible with iptables >=1.8 so the slogan in this issue is somewhat misleading. I have made basic tests and see no problems when using the correct version of the user-space iptables (and ipv6 with ip6tables) and the supporting libs. I don't think this problem can be fixed by altering some code in kube-proxy.

Tested versions; iptables v1.8.2, linux 4.19.3

The problem seems to be that that iptables user-space program (and libs) is (and has always been) dependent on the kernel version on the host. When the iptables user-space program is in a container with a old version this problem is bound to happen sooner or later, and it will happen again.

The kernel/user-space dependency is one of the problem that nft is supposed to fix. A long-term solution may be to replace iptables with ntf or bpf.

Iptables v1.8.2 have 2 modes (depending on soft-links);

# iptables -V
iptables v1.8.2 (nf_tables)

and;

# iptables -V
iptables v1.8.2 (legacy)

kube-proxy seem to work fine with both.

BTW; I have not tested any network policies, that is not kube-proxy of course but is is iptables.

drags commented

While the title is somewhat murky the fact is that kube-proxy is distributed using images based on debian-stretch and pulls in the iptables userspace from that distribution. When those images are run on hosts with a newer iptables this fails.

To be clear: this isn't a defect in the code, it's a defect in packaging/release.

kube-proxy is distributed using images based on debian-stretch and pulls in the iptables userspace from that distribution. When those images are run on hosts with a newer iptables this fails

Do you mean it breaks on a newer kernel? The iptables binary is part of kube-proxy so what would the on-host iptables have to do with anything?

I don't understand.

There are 2 sets of modules for packet filtering in the kernel: ip_tables, and nf_tables. Until recently, you controlled the ip_tables ruleset with the iptables family of tools, and nf_tables with the nft tools.

In iptables 1.8, the maintainers have "deprecated" the classic ip_tables: the iptables tool now does userspace translation from the legacy UI/UX, and uses nf_tables under the hood. So, the commands look and feel the same, but they're now programming a different kernel subsystem.

The problem arises when you mix and match invocations of iptables 1.6 (the previous stable) and 1.8 on the same machine, because although they look identical, they're programming different kernel subsystems. The problem is that at least Docker does some stuff with iptables on the host (uncontained), and so you end up with some rules in nf_tables and some rules (including those programmed by kube-proxy and most CNI addons) in legacy ip_tables.

Empirically, this causes weird and wonderful things to happen - things like if you trace a packet coming from a pod, you see it flowing through both ip_tables and nf_tables, but even if both accept the packet, it then vanishes entirely and never gets forwarded (this is the failure mode I reported to Calico and Weave - bug links upthread - after trying to run k8s on debian testing, which now has iptables 1.8 on the host).

Bottom line, the networking containers on a machine have to be using the same minor version of the iptables binary as exists on the host.

@danderson Do you think it would be sufficient to enforce (if possible) the host version of iptables to "legacy";

# iptables -V
iptables v1.8.2 (legacy)

and keep the >=1.8 version?

I build and install iptables myself and the "mode" is determined by a soft-link;

# ls -l /usr/sbin/iptables
lrwxrwxrwx    1 root     root            20 Dec 18 08:47 /usr/sbin/iptables -> xtables-legacy-multi*

I assume the same applies for "Debian Testing/Buster" and others, but I don't knoe for sure.

@danderson thanks. That was very succinct.

What a crappy situation. How are we to know what is on the host? Can we include BOTH binaries in our images and probe the machine to see if either has been used previously (e.g. lsmod or something in /sys) ?

As a preface, one thing to note: iptables 1.8 ships two binaries, iptables and iptables-legacy. The latter always programs ip_tables. So, there's fortunately no need to bundle two versions of iptables into a container, you can bundle just iptables 1.8 and be judicious about which binary you invoke... At least until the -legacy binary gets deleted, presumably in a future release.

Here's some requirements I think an ideal solution would have:

  • k8s networking must continue to function, obviously.
  • should be robust to the host iptables getting upgraded while the system is running (e.g. apt-get upgrade in the background).
  • should be robust to other k8s pods (e.g. CNI addons) using the "wrong" version of iptables.
  • should be invisible to cluster operators - k8s should just keep working throughout.
  • should not require a "flag day" on which everything must cut over simultaneously. There's too many things in k8s that touch iptables (docker, kube-proxy, CNI addons) to enforce that sanely, and k8s's eventual consistency model doesn't make a hard cutover without downtime possible anyway.
  • at the very least, the problem should be detected and surfaced as a fatal node misconfiguration, so that any automatic cluster healing can attempt to help.

So far I've only thought up crappy options for dealing with this. I'll throw them out in the hopes that it leads to better ideas.

  • Mount chunks of the host filesystem (/usr/sbin, /lib, ...) into kube-proxy's VFS, and make it chroot() to that quasi-host-fs when executing iptables commands. That way it's always using exactly the binary present on the host. Introduces obvious complexity, as well as a bunch of security risks if an attacker gets code execution in the kube-proxy container.
  • Using iptables 1.8 in the container, probe both iptables and iptables-legacy for the presence of rules installed by the host. Hopefully, there will be rules in only one of the two, and that can tell kube-proxy which one to use. This is subject to race conditions, and is fragile to host mutations that happen after kube-proxy startup (e.g. apt-get upgrade that upgrades iptables and restarts the docker daemon, shifting its rules over to nf_tables). Can solve it with periodic reconciling (i.e. "oops, host seems to have switched to nf_tables, wipe all ip_tables rules and reinstall them in nf_tables!")
  • Punt the problem up to kubeadm and an entry in the KubeProxyConfiguration cluster object. IOW, just document that "it's your responsibility to correctly tell kube-proxy which version of iptables you're using, or things will break." Relies on humans to get things right, which I predict will cause a rash of broken clusters. If we do this, we should absolutely also wire something into node-problem-detector that fires when both ip_tables and nf_tables have rules programmed.
  • Have a cutover release in which kube-proxy starts using nf_tables exclusively, through the nft tools, and mandate that host OSes for k8s must do everything in nf_tables, no ip_tables allowed. Likely intractable given the variety of addons and non-k8s software that does stuff to the firewall (same reason iptables has endured all these years even though nftables is measurably better in every way).
  • Find some kernel hackers and ask them if there's any way to make ip_tables and nf_tables play nicer together, so that userspace can just continue tolerating mismatches indefinitely. I'm assuming this is ~impossible, otherwise they'd have done it already to facilitate the transition to nf_tables.
  • Create a new DaemonSet whose sole purpose is to be an RPC-to-iptables translator, and get all iptables-using pods in k8s to use it instead of talking direct to the kernel. Clunky, expensive, and doesn't solve the problem of host software touching stuff.
  • Just document (via a Sonobuoy conformance test) that this is a big bag of knives, and kick the can over to cluster operators to figure out how to safely upgrade k8s in place given these constraints. I can at least speak on behalf of GKE and say that I sure hope it doesn't come to that, because all our options are strictly worse. I can also speak as the author of MetalLB and say that the support load from people with broken on-prem installs will be completely unsustainable for me :)

Of all of these, I think "probe with both binaries and try to conform to whatever is already there" is the most tractable if kube-proxy were the only problem pod... But given the ecosystem of CNI addons and other third-party things, I foresee never ending duels of controllers flapping between ip_tables and nf_tables endlessly, all trying to vaguely converge on a single stack, but never succeeding.

When using iptables 1.8.2 in nf_tables mode ipset (my version; v6.38) is still used by kube-proxy. But in nft ipset is "built-in".

It seems to work anyway but I can't understand how, or maybe my testing is insufficient.

I will try to test better and make sure the ipset's are used so they are not just defined and not used and my tests just happens to work.

But is any one can explain the relation between iptables in nf_tables mode and ipset please give a reference to some doc.

Ipset is only used in proxy-mode=ipvs. I get hits on ipset rules so they work in some way;

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *      !11.0.0.0/16          0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst /* Kubernetes service cluster ip + port for masquerade purpose */
   23  1380 KUBE-MARK-MASQ  all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst /* Kubernetes service external ip + port for masquerade and filter purpose */
   23  1380 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL /* Kubernetes service external ip + port for masquerade and filter purpose */

When using nf_tables mode rules are added indefinitely to the KUBE-FIREWALL chain;

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x8000/0x8000 /* kubernetes firewall for dropping marked packets */
....

in both proxy-mode ipvs and iptables.

Vonor commented

I experienced the same issue in #72370. As a workaround I found this in the oracle docs, which made the pods be able to communicate with each other as well as with the outside world again.

I discussed iptables/nft incompatibility in #62720 too, although that was before the iptables binary got rewritten...

It seems like for right now, the backward-compatible answer is "you have to make sure the host is using iptables in legacy mode".

FWIW, I hit this issue as well when deploying Kubernetes on Debian Buster. I've included some logging in #75418.

this works for me update-alternatives --set iptables /usr/sbin/iptables-legacy

@dcbw You said you have some iptables-experts. Can we get advice on a canonical way to probe the system for the preferred binary to run? If we have that, we can code kube-proxy and kubelet to be smarter

dcbw commented

@thockin simple answer is to run the host's iptables binaries. The decision whether to use iptables-legacy or iptables-nft depends solely on the binaries you are running. Everything we are talking about is running in the host's network namespace and already has to cooperate with other things running in the host's network namespace, especially in areas touching the firewall.

dcbw commented

I read a bit more of the bug here. Is the core problem that the IPVS proxy specifically is using ipset, which has some specific integration with iptables, but when the host has iptables-nft as the iptables binaries that causes problems?

My iptables/nftables people say that ipset + iptables-nft should work correctly...

What iptables-nft/iptables-legacy versions are people using here that have the problem?

dcbw commented

@thockin this issue has way too many things in it. The first comment's bug (can't initialize iptables table `nat': Table does not exist (do you need to insmod?)) I believe was fixed some months ago already in iptables upstream.

The bug is "with iptables >= 1.8, the host and all containers have to use the same iptables mode or things won't work". Other comments may or may not be relevant to that.

@thockin simple answer is to run the host's iptables binaries.

Yeah, but how simple is that? It's not just a matter of mounting /usr/sbin/iptables into your container, because that's a symlink to something else. Is just mounting in all of /usr/sbin enough, or do you need to mount in /lib64 (or whatever) as well to guarantee you'll get the right shared libraries? Does it ever look in /etc/sysconfig? Etc. What exactly are we doing in OCP?

dcbw commented

@danwinship the symlink itself will actually tell you what the binary is. Following the 'iptables' symlink will either:

  1. be a direct symlink to iptables-legacy or iptables-nft
case $(readlink /sbin/iptables) in
xtables-legacy-multi|iptables-legacy)
      echo "legacy"
      ;;
xtables-nft-multi|iptables-nft)
      echo "nft"
      ;;
esac
  1. on systems that use 'alternatives' (because hey Linux is all about CHOICE! right???) we can call alternatives to tell us (but I'm not sure exactly what alternatives does underneath):
case $(alternatives --list | awk '/^iptables /{print $3}') in
*iptables-legacy|*xtables-legacy-multi)
      echo "legacy"
      ;;
*iptables-nft|*xtables-nft-multi)
      echo "nft"
      ;;
esac

Even that's a bit more complicated. Possibly the best choice here is to simply accept that if a container wants to modify the host OS then it may need to run tools provided by the host OS and not blindly assume that stuff it ships internally can always be used. Mount the host bin/lib/etc into /host and then have your internal /usr/sbin/iptables be a chroot wrapper into those dirs so that when kube-proxy calls iptables it actually runs the wrapper and does the right thing.

I remain unconvinced that containers that wish to modify the host OS can just blindly go about whatever they want to do and assume the host OS doesn't matter.

dcbw commented

Another option I'm checking is whether the container could call 'nft list ' to see if iptables-nft had already run before. Not sure yet, but might be accurate enough to let tools the container bundles decide.

I remain unconvinced that containers that wish to modify the host OS can just blindly go about whatever they want to do and assume the host OS doesn't matter.

I wasn't trying to argue for that. I was just saying that "use the host iptables binary" is a very underspecified answer, and we need to provide more detail than that, or we're going to keep fielding duplicates of this bug forever.

(I guess in the general case you have no idea what the underlying distro is so you really can't assume anything beyond "iptables itself is in either /sbin or /usr/sbin". So probably the answer is "Mount the node's / onto /host in the container and then chroot into that to run iptables"?)

dcbw commented

I remain unconvinced that containers that wish to modify the host OS can just blindly go about whatever they want to do and assume the host OS doesn't matter.

I wasn't trying to argue for that. I was just saying that "use the host iptables binary" is a very underspecified answer, and we need to provide more detail than that, or we're going to keep fielding duplicates of this bug forever.

(I guess in the general case you have no idea what the underlying distro is so you really can't assume anything beyond "iptables itself is in either /sbin or /usr/sbin". So probably the answer is "Mount the node's / onto /host in the container and then chroot into that to run iptables"?)

Yeah, basically. Which is what we're doing in OpenShift. Probably need to mount /lib too because the iptables binaries link to libxtables and libip[4|6]tc and libpcap.

dcbw commented

So it was suggested by iptables developers that 'nft list ruleset' could be run by the thing wishing to know whether iptables-legacy or iptables-nft was being used. If that returns anything, it indicates that nftables has been initialized and is being used and thus iptables-nft should be used.

If not, that indicates that either (a) iptables-legacy is in-use or (b) nothing has created any netfilter tables yet so the system is in uninitialized state. (b) is pretty unlikely as usually during system bootup things add firewall rules before any containers are run.

Just wanted to post another confirmation here that when installing / running a K8s cluster on Raspberry Pis with Raspbian 10 / Buster, I had to run:

update-alternatives --set iptables /usr/sbin/iptables-legacy

Otherwise I was getting lots of errors with networking from various non-k8s-core pods (e.g. coredns, ingress, kube-proxy, kube-apiserver were seemingly fine, but flannel, metrics-server, nfs-client-provisioner were crashlooping).

I ran the above command on each node and rebooted all nodes, and everything quickly switched to Running status.

I can confirm that updating the host to use iptables-legacy works on Raspbian 10 (arm) and Debian 10 (amd64) to resolve the iptables mismatch issue.

For completeness it may be beneficial to update all of the network tools to use the legacy versions to avoid issues. These commands may or may not be pertinent depending upon specific host configuration but will avoid mixing legacy and nft modes if invoked from outside docker/kubernetes.

update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy

Have we decided on an approach here? While pinning iptables to program ip_tables is good, it sounds like if we don't win the race to update alternatives, and something else programs an nf_table rule, we end up with both ip_tables and nf_tables and that packets disappear (per #71305 (comment) ).

But on the other hand, if we can ensure that we don't mix ip_tables and nf_tables, do we now have to run all tests with both nf_tables and ip_tables as the backing implementation? Is nf_tables something we feel confident behaves identically?

We could do the same thing we did with swap: refuse to start if we detect any nf_tables usage, because while it's possible to use correctly, all signs point to no.

Should we then expedite a move to something like eBPF that gets us above the fray of the "table wars"?

Our approach in OpenShift is to have the relevant pods mount the entire host filesystem, and in the corresponding image we install wrapper scripts in /usr/sbin that chroot to the host filesystem and exec the copy of iptables there.

These images then work on any system regardless of whether it has old or new iptables, and in the latter case, whether that iptables is configured to use "legacy" or "nft" mode. (In particular, these images work on both RHEL 7, using legacy iptables, and RHEL 8, using new iptables in nft mode.)

But I'm thinking we should get iptables upstream to add a new "figure it out" mode to the client binaries, which would internally do something along the lines of what @dcbw suggested above to figure out if the system iptables was using nft mode or legacy mode, and then it would just use the same mode. Then we just tell everyone "make sure your containers are using the iptables-for-containers package from iptables version 1.8.whatever or later" and they don't have to worry beyond that.

@danwinship my concern is the validation though - should we now run every e2e once with nf_tables and once with ip_tables? (Ignoring more efficient testing strategies)

I like your suggestion of having iptables do something more sensible, but even if we could guarantee that we only use nf_tables, I don't believe that's what we've built & tested for.

The two modes are supposed to be equivalent in terms of behavior. (The advantage of using iptables in nft mode is that it lets other parts of the system use nft directly, and get nft's advantages, and their rules will interoperate correctly with the iptables-nft rules. Whereas if you use iptables-legacy, the iptables rules and nft rules would conflict with each other in complicated ways.)

So anyway, the two modes are supposed to be equivalent, so we shouldn't have to test against both modes, and if we did, and something in kubernetes didn't work right in one mode, that would indicate an iptables or kernel bug, not a kubernetes bug. It's possible we might end up wanting to add workarounds to kubernetes for a bug in one or the other mode at some point, if someone discovers such a bug, but I don't think we need to be testing against both modes continuously.

(And in practice, OCP on RHEL 8 using nft mode works just fine, other than possibly one problem with a -j REJECT mysteriously not actually rejecting and behaving like it was -j DROP.)

Thanks @danwinship - good to know it does seem to work on RHEL8!

My view is that we should actively choose ebpf vs nft vs ipvs vs ipt, and if someone wants to change that they have to chop the wood and carry the water - i.e. if the iptables maintainers do the work to replace iptables in our containers, and are generally available to fix breakages, then we should consider adopting nft.

In the absence of that, my view is that we should make a deliberate decision. If that means we're not yet ready to support nft, we could simply refuse to run when the OS configuration is incompatible, like we did with swap.

It's possible we might end up wanting to add workarounds to kubernetes for a bug in one or the other mode at some point, if someone discovers such a bug, but I don't think we need to be testing against both modes continuously.

That's the approach we've taken with Calico. We found an iptables-nft bug and worked around. The iptables developers were responsive when we reported the bug and, as it happens, they'd already committed a fix for the issue we hit and that fix was released into debian within a few weeks.

We haven't decided what we're doing wrt to detecting nft mode. So far, we just require the user to select it with a flag.

Some issues with mounting in the host filesystem:

  • people will get annoyed (we've gotten flak for mounting in the kernel modules directory so we can load iptables extensions, for example)
  • the host may not have iptables or it may be installed without optional packages (debian split iptables and iptables v6 IIRC and they probably put all the kernel modules in individual packages because they like to be fine-grained)

Telling people to switch to legacy mode won't work for long. Soon, nftables itself will be in active use and legacy mode won't be an option for a lot of folks.

A built-in "figure it out" mode seems right. This is frankly ridiculous. This is what APIs are for, and like it or not exec iptables is an API. Forcing all parties to coordinate and use the same binaries is ridonculous and clearly not workable.

It sounds like openshift has implemented a "figure it out" mode on its own. But I know many customers who are not going to be happy hostPath mounting / into kube-proxy. Do we REALLY need the host's binaries or can we install iptables 1.8+ and call our own iptables.sh which does the same detection?

What distros are known to have 1.8 available so I can do some playing?

@thockin Debian Buster is the main one (as Debian is used as the default distro by many of the k8s components), Ubuntu 19.04, RHEL 8 (and the upcoming Centos 8 by extension), Alpine 3.10, Fedora >= 29.

Ah, I too stumbled on Debian 10 (or buster) for a couple of hours thinking I have messed my networking.

update-alternatives --set iptables /usr/sbin/iptables-legacy

The above command did not help in my case; installation through RKE.

FATA[0000] [Failed to start [rke-etcd-port-listener] container on host [192.168.0.70]: Error response from daemon: driver failed programming external connectivity on endpoint rke-etcd-port-listener (670697328b8a06af1fd882ff0ec1555e660e419c51396542b663782f2ede9ece):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 2380 -j DNAT --to-destination 172.17.0.2:1337 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1))] 

Everything started working after installing Debian 9.9.0 :)

A built-in "figure it out" mode seems right. This is frankly ridiculous. This is what APIs are for, and like it or not exec iptables is an API. Forcing all parties to coordinate and use the same binaries is ridonculous and clearly not workable.

I think perhaps the core oversight from iptables folks was not thinking that people would want to alter the host netns's firewall both from the host system and simultaneously from within a chroot whose software does not match the host. Outside the container/k8s ecosystem, that would seem like a daft thing to do, so it doesn't surprise me that we're in this mess now :(

FWIW, empirically I'm seeing a significant uptick in support requests for MetalLB that end up being caused by this mismatch, so if nothing else, it would be worth calling out as a gotcha in the k8s installation documentation (along with the update-alternatives workaround for Debian-based systems).

@thockin to your question about the hostPath mounting: unless further changes happen upstream, it'd be fine to install iptables 1.8+ in the container and judiciously invoke the correct variant, rather than mount the host's copy. The only remaining risk there is if bugfixes are only applied to one of the two, and a bugfix involves changing how the firewall is programmed at the netlink layer. Then you could end up with the two iptables fighting each other.

AIUI, there's no obvious way to tell from the kernel which mode is in use on a freshly-booted system. Both subsystems (ipt and nft) have existed for years now, so that's not sufficient. The "signal" is the destination of the symlinks (yes, symlinks - I shudder at the thought of them diverging).

We could write a wrapper that uses whichever subsystem already has rules present, but that is terrifyingly racy and definitely violates the least-surprise principle. Then, if there are no rules present, pick one.

Coming off of the discussion about SCTP, I wonder if the right way for distros to "signal" which subsystem they wish to use is by module blacklisting.

Separately, we've been able to be very lazy around multiple iptables versions, and that party is certainly over. I'm unaware if there are any guarantees between different versions of the nft-ipt-compat logic. Can older versions correctly work with the ruleset generated by newer ones? What happens when kube-proxy uses the host's iptables-nft binary, and an Istio container uses an older (debian) one?

What if (ugh) kubelet provided a mini-firewalld?

AIUI, there's no obvious way to tell from the kernel which mode is in use on a freshly-booted system.

That is correct, but fortunately we don't have to handle "freshly-booted" systems, we only have to handle systems where kubelet is already running. Kubelet will create iptables rules of its own with the system iptables binaries (we don't support systemd-containerized kubelet any more) before any containers are running, so containers can just say "if there are any rules in nft then use iptables-nft, else use iptables-legacy".

Coming off of the discussion about SCTP, I wonder if the right way for distros to "signal" which subsystem they wish to use is by module blacklisting.

That's not a terrible idea but it still requires having both sets of binaries in every container and having a wrapper to figure out which one to use... so it's not that much better than the "see which subsystem has rules" idea.

skitt commented
  • the host may not have iptables or it may be installed without optional packages (debian split iptables and iptables v6 IIRC and they probably put all the kernel modules in individual packages because they like to be fine-grained)

Your main point stands, but the Debian (and Ubuntu, and ...) packages aren’t fine-grained: iptables contains all the (arp|ep|ip|ip6|x)tables tools, and the kernel package contains all the modules. There are package splits in the iptables source, but they split out library packages, not tool packages, and the iptables package depends on them all anyway.

dims commented

@danwinship @thockin Looks like it would be good to document any workarounds for 1.16 release notes since folks are hitting this already? (see #82361 for example)

Perhaps Kubernetes could bind mount a static iptables "proxy" binary into containers (at all popular locations) that proxies to the host's iptables.

Update: To clarify, per @thockin's OOB request, I'm not proposing bind mounting a directory or even any of the host's files/binaries.

I'm proposing we write a new static binary (let's call it iptables-proxy-client) in, say, Go, and then whenever we start a container we bind mount iptables-proxy-client at all possible paths that iptables and friends are commonly seen at in various distros/containers.

When invoked, iptables-proxy-client forwards its arguments and environment (or necessary subset) off to a bind-mounted Unix socket or link local service running on the host, communicating with a new host-side iptablesd daemon/service that executes the iptables/etc command on the host (using the host's preferred kernel mechanism) and replies with the stdout/stderr/exit code back to the client.

Because it'd be a static binary, we don't need to deal with spraying potentially-conflicting shared libraries from the host all over the containers. We'd shadow just the iptables binaries, for all containers that have one of those binaries at one of those paths.

fortunately we don't have to handle "freshly-booted" systems, we only have to handle systems
where kubelet is already running. Kubelet will create iptables rules of its own with the system
iptables binaries (we don't support systemd-containerized kubelet any more) before any containers
are running, so containers can just say "if there are any rules in nft then use iptables-nft, else use
iptables-legacy".

This sounds very feasible. I started digging a bit deeper.

First bump: our kube-proxy image builds on debian-iptables which builds on debian-base, which is stretch. We could a) convert everything to buster as debian-base v2.0.0; b) start publishing debian version-named base images; c) one-off this one. @tallclair thoughts? It's unclear the scope of such a change.

We'd want something in-path replacing iptables; iptables-save; iptables-restore.

We'd want logic akin to:

BASE_CMD="iptables-legacy"
if [ -n "$(nft list tables ip)" ]; then
    BASE_CMD="iptables-nft"
fi

Does that seem right? I tested iptables-nft and observed that it creates nftables in family ip.

@justaugustus for release viz.

dims commented

also cc @lachie83

Trying to do a naive debian-base update is somewhat fraught.

Step 7/11 : RUN echo "Yes, do as I say!" | apt-get purge     bash     e2fsprogs     libcap2-bin     libmount1     libsmartcols1     libudev1     libblkid1     libss2     libsystemd0     ncurses-base     ncurses-bin     tzdata
 ---> Running in 7fd1177f5e10
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 apt : Depends: libapt-pkg5.0 (>= 1.7.0~alpha3~) but it is not going to be installed
       Recommends: ca-certificates but it is not going to be installed
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
The command '/bin/sh -c echo "Yes, do as I say!" | apt-get purge     bash     e2fsprogs     libcap2-bin     libmount1     libsmartcols1     libudev1     libblkid1     libss2     libsystemd0     ncurses-base     ncurses-bin     tzdata' returned a non-zero code: 100
make[1]: *** [Makefile:78: build] Error 100
make[1]: Leaving directory '/usr/local/google/home/thockin/src/k/kubernetes/build/debian-base'
make: *** [Makefile:56: sub-build-amd64] Error 2

Removing libslang2 and libsystemd0 make that go away. libprocps6 claims to not exist. A bunch of others claim to not be installed. Streamlining that list gets me a build of 52.4MB

@tallclair there's apt magic happening that I don't follow, and your fingerprints are on it :)

Now an error installing nftables:

dpkg (subprocess): unable to execute new nftables package pre-installation script (/var/lib/dpkg/tmp.ci/preinst): No such file or directory
dpkg: error processing archive /tmp/apt-dpkg-install-4N16Ls/5-nftables_0.9.0-2_amd64.deb (--unpack):
 new nftables package pre-installation script subprocess returned error exit status 2
dpkg (subprocess): unable to execute new nftables package post-removal script (/var/lib/dpkg/tmp.ci/postrm): No such file or directory
dpkg: error while cleaning up:
 new nftables package post-removal script subprocess returned error exit status 2
Errors were encountered while processing:
 /tmp/apt-dpkg-install-4N16Ls/5-nftables_0.9.0-2_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

it looks like the nftables pkg wants bash

We'd want logic akin to:

BASE_CMD="iptables-legacy"
if [ -n "$(nft list tables ip)" ]; then
    BASE_CMD="iptables-nft"
fi

Does that seem right?

It's not entirely reliable. The default tables will be created any time you run iptables-nft, so if the user is on an iptables-legacy system but runs a stray pod that erroneously invokes iptables-nft, then that would create those tables and then cause your wrapper to do the wrong thing until reboot.

We eventually decided that you can't really do any better than:

    legacy_lines=$(iptables-legacy-save 2>/dev/null | wc -l || echo 0)
    nft_lines=$(iptables-nft-save 2>/dev/null | wc -l || echo 0)
    if [ "${legacy_lines}" -lt "${nft_lines}" ]; then
        mode=nft
    else
        mode=legacy
    fi

which is terrible, but it's unlikely to be fooled. You'd definitely want to only do this once and then cache the answer.

I had started working on this at https://github.com/danwinship/iptables-wrappers/ but the scripts and README there are all totally inconsistent with each other because I kept changing my mind about what the best way to make this work was...

If I push a branch, can you start integrating with that, maybe PR you changes into my branch? @bradfitz has another idea he will expand on in a but, but I want to see how bad this rabbithole gets.

NB: We'd need every single container that uses iptables to participate in this...

We'd also need some EOL plan - when can we stop doing this?

Updated my comment above: #71305 (comment)

https://github.com/thockin/kubernetes/tree/iptables-nft is a WIP on building base images. We'd need to hook into the Dockerfile to take over the bare iptables commands

@bradfitz

When invoked, iptables-proxy-client forwards its arguments and environment (or necessary subset)
off to a bind-mounted Unix socket or link local service running on the host, communicating with a
new host-side iptablesd daemon/service that executes the iptables/etc command on the host (using
the host's preferred kernel mechanism) and replies with the stdout/stderr/exit code back to the
client.

We discussed a bit offline, but for the record, some thoughts:

We would have to do this on every hostNetwork: true && privileged: true pod (also maybe consider net capabilities). That doesn't seem SO bad - it's a small set. Do we have to handle RuntimeClass that isn't cgroups-based? Probably not going to intersect with hostNetwork && privileged.

I don't think we need to handle non-privileged pods (or without the caps we care about) - they should not be able to use iptables anyway, but we should triple check that.

What about things that run iptables inside their own netns (e.g. istio's capture)? @danwinship do they need to use the same backend, or can it be mixed mode at that scope? @louiscryan FYI

We would have to run this new iptablesd daemonset on every node. Need telemetry, provisioning, etc. That should be a wash wrt actual memory consumption, but only if we can reclaim some from kube-proxy and/or calico and/or ...

This plays very badly with PodSecurityPolicy unless we trap in WAY below it (at CRI) which is a much harder fix. Doing it as admission is more transparent, but possibly subject to ordering bugs.

πŸ‘‹ Friendly ping from 1.16 release lead. I wanted to let you know that we are planning to cut 1.16.0-rc.1 tomorrow and go into code-thaw. Please let me know if this fix needs to be considered as 1.16 release blocking.

NB: We'd need every single container that uses iptables to participate in this...

Every single container that uses iptables in the root network namespace. It's fine for, eg, istio, to use whatever iptables mode it wants in the pod namespace. (Though if you have multiple sidecars in a pod they all need to use the same mode...)

We'd also need some EOL plan - when can we stop doing this?

Probably as long as we care about people running Kubernetes on RHEL/CentOS 7. (People will probably be running RHEL 7 longer than people are running CentOS 7, but we might care about those users less. Either way, by the time we stop caring about that, everyone else should be using nft mode.)

I don't think we need to handle non-privileged pods (or without the caps we care about) - they should not be able to use iptables anyway, but we should triple check that.

That is correct. Pods need to be hostNetwork and either privileged or CAP_NET_ADMIN for them to matter.

The only thing we can do in the near term is tell people to use legacy mode.

Even 1.8.2 (as present in debian-buster) is broken. #82361

How best to document this?

I'm proposing we write a new static binary (let's call it iptables-proxy-client) in, say, Go, and then whenever we start a container we bind mount iptables-proxy-client at all possible paths that iptables and friends are commonly seen at in various distros/containers.

Do we do anything even remotely similar to this currently? (The "overwriting binaries in other people's containers without telling them" part, not the proxying part.)

It's nice in that it solves the problem for everyone all at once but... not nice in every other way πŸ™‚

Random idea halfway between the two current approaches: add a new volume type "hostBinaries" or "kubernetesHelpers" or something, and if you mount a volume of that type into your pod, you'll find that it contains iptables binaries that do the right thing via unspecified means. (And in the future, maybe also contains other binaries to solve similar host/pod interaction issues? Kind of a more powerful downward API sort of thing.)

I agree it's kind of awful, but so is the problem...

We'd STILL need to run that daemonset, which is a significant change.

We should be able to give containers a working set of iptables-legacy or iptables-nft binaries directly rather than needing a proxy. Just give them an entire chroot rather than just the binaries. (ie, build a Debian container image containing only the iptables package and the packages it depends on (eg, glibc), and then mount that somewhere in the pod). Then instead of overwriting their /usr/sbin/iptables with a proxy binary, you overwrite it with a shell script that does chroot /iptables-binary-volume-sadkjf -- iptables "$@", etc. Or that works with the hostBinaries volume idea too; the volume would just contain the chroot within it in addition to the wrapper scripts.

It's nice in that it solves the problem for everyone all at once but... not nice in every other way.

Agreed. I proposed it because it'd fix everything all at once, rather than waiting for all network add-ons to update to a work either the old way (just exec iptables) or some new way (find some new downward API directory to chroot into). How long would that take?

We'd STILL need to run that daemonset, which is a significant change.

Put it in the kubelet? Too gross? It's already in the business of calling iptables, no?

Reminder (as this discussion shakes out): if you're calling iptables, there is a nonzero chance you will need to load a module. So, it has become part of the lore that best-practice iptables-callers already always need to bind-mount /lib/modules from the host.

If we're going down the route of half-magic bind-mounts, then I think I'd rather see the kubelet assemble it out of bind-mounts from the host, rather than needing a specific container.

if you're calling iptables, there is a nonzero chance you will need to load a module.

No, there isn't. Kubelet will always have created iptables rules before starting any pods.

I'd rather see the kubelet assemble it out of bind-mounts from the host, rather than needing a specific container.

I thought about that, but the reason we didn't try to do that before is that there are no safe assumptions you can make about what the distro-installed version of iptables does and doesn't need from the host filesystem. (eg, there's no a priori reason to think that /etc/alternatives would be needed, since the iptables source code itself does not refer to any such thing.) If you want to use the system iptables you have to mount the entire filesystem.

if you're calling iptables, there is a nonzero chance you will need to load a module.

No, there isn't. Kubelet will always have created iptables rules before starting any pods.

Each match and action type has its own module, loaded on demand. So if you do -m set then you'll trigger loading of the xt_set module. If you do -j DNAT, then you'll load xt_DNAT and so on.

There are two different kind of kernel module loading here: the kind that Casey was worrying about is that if you run /usr/sbin/iptables and it finds that the ip_tables (or ip6_tables) module is not loaded, then it will explicitly invoke modprobe to load it, and that only works if modprobe and ip_tables.ko are available.

For the thing you're talking about, the iptables binary isn't what loads the module. When you send a rule to the kernel using -m set, the kernel netfilter code will decide that it needs to have the xt_set module, and so it will send a request to userspace, and that request gets received and handled by udevd in the root net/pid/mount/etc namespace, regardless of where the original iptables call came from. So the container doesn't need access to the modules in that case.

@danwinship @thockin Looks like it would be good to document any workarounds for 1.16 release notes since folks are hitting this already? (see #82361 for example)

I've created kubernetes/website#16271 to add documentation to the Installing kubeadm page. There is some discussion in that PR (comment) on whether we should add this information in other places too.

Adding to release notes is already on the agenda: #81930 (comment)

For future readers not able to make kube-proxy work for some reasons, you might want to look for a replacement: https://github.com/cloudnativelabs/kube-router (it does the job plus some other stuff, please take a look at the documentation before doing anything)

Kind of a creepy idea, but you could use nsenter to run the iptables command on the host in the hosts environment.

I've started to play with CentOS 8, it comes with iptables v1.8.2 (nf_tables) but without iptables (legacy)
https://access.redhat.com/solutions/4377321
Haven't found how OpenShift 4 works yet

aojea commented

I've started to play with CentOS 8, it comes with iptables v1.8.2 (nf_tables) but without iptables (legacy)
https://access.redhat.com/solutions/4377321
Haven't found how OpenShift 4 works yet

The iptables user space tools are provided in a container #71305 (comment)

The iptables 1.8.2 in RHEL/CentOS 8 has the necessary bugfixes from 1.8.3 backported

The status of this issue is that it was resolved by #82966, if I read correctly, and therefore Kubernetes 1.17 (kube-proxy in 1.17) should work without having to switch the nodes to iptables-legacy?

The official kubernetes packages, and in particular kubeadm-based installs, are fixed as of 1.17. Other distributions of kubernetes may have been fixed earlier or might not be fixed yet.

did we wind up backporting this at all?

no... we'd have to backport the whole rebasing-the-images-to-debian-buster thing which seems like a big change for a point release

I experienced the same issue in #72370. As a workaround I found this in the oracle docs, which made the pods be able to communicate with each other as well as with the outside world again.

Updated link:
https://docs.oracle.com/en/operating-systems/oracle-linux/kubernetes/kube_admin_config.html#kube_admin_config_iptables

To get rid of that libvirt error, my permanent workaround in Debian 11 (as a host) with libvirtd daemon is to block the loading of iptables-related modules:

Create a file in /etc/modprobe.d/nft-only.conf:


#  Source: https://www.gaelanlloyd.com/blog/migrating-debian-buster-from-iptables-to-nftables/
#
blacklist x_tables
blacklist iptable_nat
blacklist iptable_raw
blacklist iptable_mangle
blacklist iptable_filter
blacklist ip_tables
blacklist ipt_MASQUERADE
blacklist ip6table_nat
blacklist ip6table_raw
blacklist ip6table_mangle
blacklist ip6table_filter
blacklist ip6_tables

libvirtd daemon now starts without any error.

Post-analysis: Apparently, I had iptables module loaded alongside with many nft-related modules; once iptables was gone, the pesky error message went away.