kubernetes/kubernetes

Proposal: nftables backend for kube-proxy

nevola opened this issue ยท 66 comments

/kind feature
/sig network

nftables (iptables next generation) development is almost completed with all the iptables abilities, but also, with a much flexible language that allows to have a very complete load balancer with a very little extension of the designed infrastructure with higher performance.

For that reason, we've created a small footprint daemon to optimize and to make easy the rules generation named nftlb, which is fully compatible with additional firewalling nftables rules.

The current abilities are:

  • Topologies supported: Destination NAT, Source NAT and Direct Server Return. This enables the use of the load balancer in one-armed and two-armed network architectures.
  • Support for both IPv4 and IPv6 families.
  • Multilayer load balancer: DSR in layer 2, IP based load balancing with protocol agnostic at layer 3, and support of load balancing of UDP, TCP and SCTP at layer 4.
  • Multiport support for ranges and lists of ports.
  • Multiple virtual services (or farms) support.
  • Schedulers available: weight, round robin, hash and symmetric hash.
  • Priority support per backend.
  • Live management of virtual services and backends programmatically through a JSON API.
  • Web service authentication with a security Key.
  • Automated testbed included.

This approach also solves some known corner cases with the iptables and IPVS backends, and even performance improvements. More info here:
https://www.zevenet.com/knowledge-base/nftlb/what-is-nftlb/

Main repository can be found here:
https://github.com/zevenet/nftlb

It'd be great to know the possibilities to integrate the nftables backend in kube-proxy and if there is another use case not included yet in the nftables approach.

Thanks.

This approach also solves some known corner cases with the iptables and IPVS backends

I am curious to learn what corner cases.

and even performance improvements.

Any test data?

I think there are two separate issues:

  1. Do we want an nft-based kube-proxy in-tree? I think we were kind of hoping that IPVS was going to save the day and we wouldn't need another replacement (especially so soon), although the comments about IPVS in the blog post (especially the 10x performance claims) are interesting. (And then there's the port range problem (#23864).)

  2. Do we want to support the possibility of nft-based systems at all? Because as I understand it, even if we reject this code for kubernetes itself, but we want to allow users to run an out-of-tree third-party nft-based kube-proxy, then that still creates problems for kubernetes, because it is effectively required that either all components on the system use iptables, or all components on the system use nft. (Eg, if a network plugin wants to ensure that traffic coming off its internal bridge doesn't get eaten by the firewall, then it needs to know whether the firewall is implemented with iptables rules or nft rules, because adding the exception to the wrong system won't work.)

    • Kubernetes still uses iptables rules for a few things other than kube-proxy (eg, HostPort), and presumably all of that would have to be switched over to use nft. (This isn't a lot of work, but would probably end up requiring duplicate iptables/nft codepaths, and some way for kubelet to know which one to use.)
    • CNI plugins that create their own iptables rules (and that want to support the nft kube-proxy) would probably need to have separate iptables and nft codepaths too. And some way to figure out which one to use.
    • Pods that use hostNetwork and modify iptables rules are also an issue, but that one has to be Somebody Else's Problem. ("If you have pods that do iptables stuff, you can't switch to the nft kube-proxy until you update them.")
    • I think at this point, docker is probably not an issue: if you are using a CNI plugin that doesn't use the docker bridge, then docker's own use of iptables is irrelevant and harmless. (It might add rules that do nothing, but it shouldn't end up adding any rules that would hurt anything.)

I think the answer to question 2 (do we want to support nft-based systems at all) is "yes". Or at least, if it's not "yes" now, it will be eventually. iptables is considered a technological dead end in the kernel and there's more and more work going into improving nft and less and less going into iptables.

Are you actually using nftlb with kubernetes already? Have you run into nft-vs-iptables problems? (Or are you only using it for cluster-ingress routing, not for pod-to-pod routing?)

@danwinship

(And then there's the port range problem (#23864).)

I proposed fwmark + IPVS to implement port ranges in kubernetes/community#1738, please check the IPVS section.

Any comment?

Thank you for your response @danwinship

  1. Do we want an nft-based kube-proxy in-tree?
  2. Do we want to support the possibility of nft-based systems at all?

That are interesting to know. FMPOV, the adoption of nft will be done sooner or later and all major distributions are currently working on the integration.

The kube-proxy proposal to use nft is just an example, cause I consider that the value added in terms of usability and performance really worths it.

But, as you said, the full picture is bigger than that and the usage should be extended to the current networking functions to support both of them. For the list you provide, we can study the "translations" to be done and identify if there is something missing in the current nft infrastructure that needs to be included.

Although during our research we provide some numbers (here where we discovered the 10x in DSR mode), we're performing further benchmarks between iptablesLB-ipvs-nftlb with several scenarios. After that, we'll work on the first PoC of kubernetes with nft which hopefully will be ready by September.

Or are you only using it for cluster-ingress routing, not for pod-to-pod routing?

Both use cases are interesting to be addressed.

Thank you for your response @m1093782566

Exactly, there are some workarounds but the main idea of this proposal is to discuss if the integration of the nft infrastructure is something to take into account in the roadmap.

Here's my feelings:

  • I am open to an nft impl in kube-proxy (and EBPF, maybe) if we can keep it isolated, and if we can keep the feature set consistent. Doing it out of tree is also viable, but if we want to push on that we should do a better job of writing a spec for services impls.

  • We really need to modularize kube-proxy better :)

  • the FWMARK hack for IPVS ranges is a hack and I really dislike using it

  • This adds MORE fuel to the fire for making the net plugin and services plugin more closely coupled

Hi, during the netfilter workshop last week we presented some benchmarks that could enlighten the nftables approach for kube-proxy.

As a summarize, we're getting more than 50% of performance in NAT cases than iptables (with 3 backends cases) but the performance is even better by adding more backends due to the constant complexity order of the nftables rules design.

As well, we presented the penalties caused by Spectre/Meltdown mitigations that is about 40% of penalty in iptables but only a 17% in nftables, tested with the same NAT cases with conntrack enabled.

Extended info here:
https://www.zevenet.com/knowledge-base/nftlb/nftlb-benchmarks-and-performance-keys/

On the other hand, nftables-ebpf shouldn't be incompatible. There are some work in progress to go forward to this integration: https://www.spinics.net/lists/netfilter-devel/msg53891.html

So, @nevola how do you want to proceed? Here's my proposal.

  1. between you, the IPVS folks, and the iptables folks (myself and others) we write a doc that covers all of the things you need to implement to be a viable Service implementation (node ports, external IPs, etc).

  2. you write a new kube-proxy module

  3. we let users decide which mode they want.

Hi @thockin that sounds good. Although I'm not a go developer expert, I accept the challenge.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

We're still working on this issue.

Hi, we've started with a prototype named kube-nftlb, thanks to the collaboration of @AquoDev

https://github.com/zevenet/kube-nftlb

Any feedback and guidance to continue with the integration will be appreciated.
Thanks!

@nevola I'm very interested in the DSR story (just technically / future proofing, we haven't hit a performance point where the distinction would be important to us today). What encap are you using to forward to nonlocal pods? How are you telling k8s that the pod is allowed to send from the destination address etc?

FWIW IPVS supports DSR too, so though I'm totally pro nftables the DSR feature shouldn't really be tied to nftables <-> IPVS (and in fact IPVS should be switching to nftables at some point too, right?)

Hi @rbtcollins the first milestone with this project is to integrate nftlb with the same features that kube-proxy provides. Currently, we're implementing a go client separated from the nftlb daemon in order to manage the "translation" layer between k8s and the nftlb API.

The forwarding to non-local pods has not been faced yet, but if you've any concern in that regards please let us know to have it into consideration.

nftables and IPVS have different implementations for DSR but IPVS relies on the netfilter hooks for features like multiport or tproxy. Once the migration to nftables is done, it shouldn't affect to IPVS.

Hi everybody, we've recently released nftlb 0.3 with the following new features:

  • Stateless NAT support from ingress
  • Automated DSR configuration from layer 3
  • Flow mark per service and per backend
  • Logging support per virtual service
  • L7 helpers support
  • Support of custom source IP instead of masquerading

Currently, kube-nftlb is able to manage the minimal options to create a working scalable service, but now we're in the phase of making kubernetes & docker rules live together with nftlb ones. Using a host with the iptables-nftables compatibility layer this is possible.

We hoped that kube-proxy was the only component that manages rules, but that fact is not true. So we're thinking about implementing some intelligence in nftlb to know which rules are active already and then insert the nftlb ones accordingly.

If you've any other idea it'll be appreciated,
Thanks!

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

kvaps commented

/remove-lifecycle stale

@nevola Has there been anymore progress on this?

Hi @cmluciano , sure we're implementing more features in nftlb (the load balancer core) including security policies per service, persistence and more, but it's still required more integration in the kube-nftlb side in order to be fully compatible with the docker rules.

@nevola a while back #62720 (comment) it was suggested a doc to be written, was that done and if so can you please share it ? thx

Hi @DanyC97 , it wasn't arranged but I'll be available to follow-up with the requirements.

In the year since I last looked at this, I have done some soul-searching. I don't think that we want to jam more proxy modes into kube-proxy, and I would ask that we consider how to do this as a stand-alone binary. We can refactor kube-proxy to make things more reusable if needed. We can move libs to more available repos in needed.

Hi Tim, currently nftlb is a binary that can be launched in daemon mode (provided with a web service API) but also, as an standalone executable that generates rules according to an input configuration (in JSON format). For sure, it is required to have installed the netfilter/nftables stack but I think that it can be compiled statically, if you mean so.

AFAIU the daemon should understand kubernetes API in order to create the virtual services (that is what we were trying to do with kube-nftlb) or, maybe make kube-proxy to translate the kubernetes API to nftlb in any of the 2 modes, daemon or standalone.

@nevola I have implemented nftables go library which does not depend on any external command, it talks directly to netfilter for rule programming. I am also looking into building nft specific kube proxy based on the library. At this point I am at the design phase, specifically I am considering whether to use on the fly rule generation same as done in iptables kube-proxy or to use map/vmap of nftables as more efficient way of updating rules. Please let me know if you are interested to sync up.

Hi @sbezverk, sure I'm aware of your work with the nftables library in go. I think it's a good approach but I've 2 main concerns:

  1. Talking directly with netlink you're skipping too much validation logic from the nftables user space stack. You know that, currently, nftables user space logic is build as a library (libnftables), so it can be compiled statically and to be used from any binary (for example, as nftlb does).
  2. Nftables solves a lot of design problems with iptables, so any direct translation would not help to improve or solve current problems with kube-proxy.

@nevola Thank you for your reply. You are right, in my case I rely on the package internal validation. It appears to work out well so far. One of the reason I would not want to rely on libnftables is additional dependency. Example for some time validation of transparent proxy was broken, so we had to wait 0.9.1 for a good chunk of time before it gets fixed and then published as a package, even now not all distros carry this fix. Talking directly to netfilter eliminate this type of issues.
I totally agree about point 2, the reason I was considering following iptables model was just to get things working and being able to sort of test e2e, I do not mind to refactor to use more advanced nftables things like maps or vmaps later.
Anyway, if you see any ways for us to collaborate please let me know, I would be happy to chat.

At this point I am at the design phase, specifically I am considering whether to use on the fly rule generation same as done in iptables kube-proxy or to use map/vmap of nftables as more efficient way of updating rules.

I don't know whether that's specifically a good idea or not, but the iptables kube-proxy is heavily optimized toward what works best with the iptables API, and you should definitely spend time thinking about other approaches that might work better with nft.

in my case I rely on the package internal validation

For example, how do you ensure that the kernel you're talking to is using a certain NFTA attribute? How do you ensure that your raw netlink chunk will be valid for all cases?

Example for some time validation of transparent proxy was broken, so we had to wait 0.9.1 for a good chunk of time before it gets fixed and then published as a package, even now not all distros carry this fix.

You can always use the official git to not rely on packaging. Note that you'll be always relying on the nft design before your integration, so nft will be always ahead of your library.

In regards to the nft design, we've been working for sometime and, currently, we're able not only substitute completely lvs, but also, providing extra features and solving some corner cases.

Please take a look to our testing suite in order to know how we generate nft optimized rules for load balancing, they're in the form JSON input -> NFT output, but our idea in the near future would be KUBE API -> NFT output.

https://github.com/zevenet/nftlb/tree/master/tests

Please take a look as well to the rough prototype with kube-nftlb

https://github.com/zevenet/kube-nftlb

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/remove-lifecycle stale

/reopen

Although this project has been paused in order to fully implement nftlb, we're currently working in the integration.

@nevola: Reopened this issue.

In response to this:

/reopen

Although this project has been paused in order to fully implement nftlb, we're currently working in the integration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nevola Please reopen this, thanks.

/reopen

@nevola: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@nevola: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Updates to this project can be found here:

https://github.com/zevenet/kube-nftlb

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

@s3rj1k: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nevola Please reopen it again :)

/reopen

@nevola: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nevola the issue is closing because there's a lifecycle/rotten label, which should be removed by you (if you think it's a thing) with /remove-lifecycle rotten

About the implementation, may I suggest you (as this is an old issue) to bring this subject to kubernetes sig-network meeting? The point here is, I think it's faster and easier to check when everybody is in one meet to verify how the SIG receives the idea of a nft kube-proxy :)

I've seen some other proposals (can't remember if it was you) in a past meeting, and as far as I remember folks weren't like, willing to add another kube-proxy backend but to decouple at max the kube-proxy and allow it to be sort of modular :)

But my memory is bad, so I might be wrong. Anyway, as this is a rotten issue, with no one assigned I propose to you bring this subject to a quick standup explanation in the SIG Meeting and then decide whether this should move as a kubernetes issue, or a separate project, or yet if some more folks are interested in helping you with this :)

Thank you!

Oh now I remember, in the meeting notes @sbezverk presented the proposal of nftproxy, and as I remember this moved as a side project, right @sbezverk ?

So, trying to make this work, do you think this is still something we need to pursue inside the kube-proxy code?

@rikatz I wish it did but since the last discussion there was not much happened. I have not gotten any clear instructions for the next steps to get nfproxy as a side project in kubernetes org. I would still like to do it if the community has any interest in nftables based kube-proxy backend..

Hi @rikatz

@nevola the issue is closing because there's a lifecycle/rotten label, which should be removed by you (if you think it's a thing) with /remove-lifecycle rotten

Will do, thank you. The project is currently under rapid evolution so I think it's not time to let it rotten :)

About the implementation, may I suggest you (as this is an old issue) to bring this subject to kubernetes sig-network meeting? The point here is, I think it's faster and easier to check when everybody is in one meet to verify how the SIG receives the idea of a nft kube-proxy :)

Sure, but I think that something real to show is needed. At least, to prove that iptables and IPVS can be fully replaced by nft. iptables replacement is now achieved, but we're now working to be able to replace IPVS.

I've seen some other proposals (can't remember if it was you) in a past meeting, and as far as I remember folks weren't like, willing to add another kube-proxy backend but to decouple at max the kube-proxy and allow it to be sort of modular :)

For sure, we're open to discuss the architecture. kube-nftlb is just a prototype to prove functionality and performance. Also, we're gathering some future requirements that currently nftlb daemon is able to perform but not yet able to integrate in k8s, that would be interesting to raise in the meeting.

But my memory is bad, so I might be wrong. Anyway, as this is a rotten issue, with no one assigned I propose to you bring this subject to a quick standup explanation in the SIG Meeting and then decide whether this should move as a kubernetes issue, or a separate project, or yet if some more folks are interested in helping you with this :)

Sure, we can arrange the participation in a SIG meeting explaining the project and the approach. Please guide me how to do that and if I can invite people from my team.

Thank you!

/remove-lifecycle rotten

@nevola as per the https://docs.google.com/document/d/1_w77-zG_Xj0zYvEMfQZTQ-wPP4kXkpGD8smVtW_qqWM, just need to put the subject in the next agenda, join the zoom meeting (next is 23/07 2pm PT) and wait until your time to speak arrive :D

@sbezverk My personal opinion (and this is strictly mine): As Kubernetes is extremelly 'plugable', I personally understand why the community is not tempted to turn this a core feature from kube-proxy but at the same time make this 'documentable'. As an example, we've a lot of implementations from Ingress Controller (although the nginx one was the first and adopted by the community), or the CNI, which allows you to choose between Calico, Flannel, Cilium etc etc.

I'm seeing the same happen with Kube-proxy, with each CNI implementing its own kube-proxy to deal with their cases. This happened with Cilium and it seems to be happening to Calico, so what I'm seeing here is that each CNI is taking its own kube-proxy path, while there's already IPVS and IPTables. This is why I think (and again, opinion of my own) maybe it's not worth to implement this as a new kube-proxy mode, but instead as some external binary and maybe referenced in the Docs, as we've this when dealing with the CNI.

Still, I think this should be bring back to the meeting so we can, together give a good direction to how to deal with this :)

Thank you!

@nevola thanks for the presentation in the meeting.

Now my question is: do you want to continue with this issue open, or move the discussion to sig-network mailing list and then see if this can became a KEP or a feature proposal?

As a side comment, I've made some confusion and though your project and @sbezverk were the same, really sorry! You both should have warned me earlier :)

Hi @rikatz sure, we can close this issue. Probably we'll continue discussing further via other channels.
Thank you!

@nevola @rikatz Is there a place we can follow for the updates. I was also looking for some updates around this.

kubernetes/kube-proxy#23

Hi @jeffgtxjava, sure the project is going to be reactivated.

I created a fork here:
https://github.com/relianoid/kube-nftd

If anyone is interested to be in the working team, please let me know.

Cheers.