Performance issues when adding subnets to firewall instead of interface
Opened this issue · 4 comments
Not sure if this should be considered a netavark firewall issue, a kernel issue, or just performance limitations.
I have a server (Ryzen 5600X, 1 gigabit ethernet port with 4 VLANs on top of it) that is acting as my home router as well as running a bunch of containers. I have 47 containers running, across 11 networks (resulting in 11 network interfaces getting created). Each of those networks has a private IPv4 subnet, public IPv6 subnet, and a ULA (private) IPv6 subnet assigned. I'm using firewalld as my backend (which itself is using nftables as the backend), with other non-container-related rules configured for general connectivity and firewalling. I also have CAKE SQM set up in both the ingress and egress directions, with the bandwidth set to 1Gb/s. The SQM for the ingress direction is set up by redirecting packets that come in to an IFB interface; this is done by adding a tc filter
rule.
Recently, I did a test with iperf3 between this server and two devices connected via ethernet. On both devices, TCP traffic from the server to the device is sent around 920Mb/s-970Mb/s (basically the full line rate), but TCP traffic from the device to the server is sent at a max of ~600Mb/s, with occasional drops to 200Mb/s. During this time, I can see that ksoftirqd is running at 100% on the server, suggesting that there's a CPU bottleneck involved. (When traffic is sent from the server to the device, ksoftirqd is not at 100% on either side.)
Here, 192.168.3.1 is the server which is running as a router and is running all of the containers.
$ iperf3 -c 192.168.3.1 -t 60
Connecting to host 192.168.3.1, port 5201
[ 5] local 192.168.3.21 port 35770 connected to 192.168.3.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 73.8 MBytes 618 Mbits/sec 0 296 KBytes
[ 5] 1.00-2.00 sec 73.5 MBytes 617 Mbits/sec 0 250 KBytes
[ 5] 2.00-3.00 sec 73.0 MBytes 612 Mbits/sec 0 227 KBytes
[ 5] 3.00-4.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 4.00-5.00 sec 69.5 MBytes 583 Mbits/sec 0 285 KBytes
[ 5] 5.00-6.00 sec 72.6 MBytes 609 Mbits/sec 0 250 KBytes
[ 5] 6.00-7.00 sec 72.1 MBytes 605 Mbits/sec 0 227 KBytes
[ 5] 7.00-8.00 sec 70.4 MBytes 590 Mbits/sec 0 168 KBytes
[ 5] 8.00-9.00 sec 72.1 MBytes 605 Mbits/sec 0 174 KBytes
[ 5] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 238 KBytes
[ 5] 10.00-11.00 sec 72.6 MBytes 609 Mbits/sec 0 122 KBytes
[ 5] 11.00-12.00 sec 72.1 MBytes 605 Mbits/sec 0 221 KBytes
[ 5] 12.00-13.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 13.00-14.00 sec 71.8 MBytes 602 Mbits/sec 0 180 KBytes
[ 5] 14.00-15.00 sec 68.4 MBytes 574 Mbits/sec 0 378 KBytes
[ 5] 15.00-16.00 sec 71.2 MBytes 598 Mbits/sec 0 116 KBytes
[ 5] 16.00-17.00 sec 71.2 MBytes 598 Mbits/sec 0 250 KBytes
[ 5] 17.00-18.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 18.00-19.00 sec 72.1 MBytes 605 Mbits/sec 0 331 KBytes
[ 5] 19.00-20.00 sec 73.0 MBytes 612 Mbits/sec 0 261 KBytes
[ 5] 20.00-21.00 sec 72.6 MBytes 609 Mbits/sec 0 180 KBytes
[ 5] 21.00-22.00 sec 70.4 MBytes 590 Mbits/sec 0 267 KBytes
[ 5] 22.00-23.00 sec 72.1 MBytes 605 Mbits/sec 0 174 KBytes
[ 5] 23.00-24.00 sec 72.2 MBytes 606 Mbits/sec 0 221 KBytes
[ 5] 24.00-25.00 sec 69.6 MBytes 584 Mbits/sec 0 349 KBytes
[ 5] 25.00-26.00 sec 71.2 MBytes 598 Mbits/sec 0 180 KBytes
[ 5] 26.00-27.00 sec 71.2 MBytes 598 Mbits/sec 0 238 KBytes
...
$ iperf3 -c 192.168.3.1 --bidir
Connecting to host 192.168.3.1, port 5201
[ 5] local 192.168.3.21 port 37654 connected to 192.168.3.1 port 5201
[ 7] local 192.168.3.21 port 37670 connected to 192.168.3.1 port 5201
[ ID][Role] Interval Transfer Bitrate Retr Cwnd
[ 5][TX-C] 0.00-1.00 sec 75.5 MBytes 633 Mbits/sec 0 459 KBytes
[ 7][RX-C] 0.00-1.00 sec 90.5 MBytes 759 Mbits/sec
[ 5][TX-C] 1.00-2.00 sec 76.5 MBytes 642 Mbits/sec 0 279 KBytes
[ 7][RX-C] 1.00-2.00 sec 89.9 MBytes 754 Mbits/sec
[ 5][TX-C] 2.00-3.00 sec 78.1 MBytes 655 Mbits/sec 0 378 KBytes
[ 7][RX-C] 2.00-3.00 sec 106 MBytes 891 Mbits/sec
[ 5][TX-C] 3.00-4.00 sec 78.4 MBytes 657 Mbits/sec 0 349 KBytes
[ 7][RX-C] 3.00-4.00 sec 89.0 MBytes 747 Mbits/sec
[ 5][TX-C] 4.00-5.00 sec 77.2 MBytes 648 Mbits/sec 0 192 KBytes
[ 7][RX-C] 4.00-5.00 sec 108 MBytes 909 Mbits/sec
[ 5][TX-C] 5.00-6.00 sec 79.4 MBytes 666 Mbits/sec 0 279 KBytes
[ 7][RX-C] 5.00-6.00 sec 80.2 MBytes 673 Mbits/sec
[ 5][TX-C] 6.00-7.00 sec 74.9 MBytes 628 Mbits/sec 0 325 KBytes
[ 7][RX-C] 6.00-7.00 sec 97.5 MBytes 818 Mbits/sec
[ 5][TX-C] 7.00-8.00 sec 65.1 MBytes 546 Mbits/sec 0 296 KBytes
[ 7][RX-C] 7.00-8.00 sec 94.4 MBytes 792 Mbits/sec
[ 5][TX-C] 8.00-9.00 sec 72.2 MBytes 606 Mbits/sec 0 267 KBytes
[ 7][RX-C] 8.00-9.00 sec 99.2 MBytes 833 Mbits/sec
[ 5][TX-C] 9.00-10.00 sec 78.1 MBytes 655 Mbits/sec 0 465 KBytes
[ 7][RX-C] 9.00-10.00 sec 101 MBytes 846 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate Retr
[ 5][TX-C] 0.00-10.00 sec 756 MBytes 634 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 754 MBytes 633 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 957 MBytes 803 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 956 MBytes 802 Mbits/sec receiver
$ iperf3 -c 192.168.3.1 --reverse -t 60
Connecting to host 192.168.3.1, port 5201
Reverse mode, remote host 192.168.3.1 is sending
[ 5] local 192.168.3.21 port 38648 connected to 192.168.3.1 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 117 MBytes 982 Mbits/sec
[ 5] 1.00-2.00 sec 117 MBytes 984 Mbits/sec
[ 5] 2.00-3.00 sec 117 MBytes 985 Mbits/sec
[ 5] 3.00-4.00 sec 116 MBytes 976 Mbits/sec
[ 5] 4.00-5.00 sec 117 MBytes 985 Mbits/sec
[ 5] 5.00-6.00 sec 117 MBytes 984 Mbits/sec
[ 5] 6.00-7.00 sec 117 MBytes 983 Mbits/sec
^C[ 5] 7.00-7.33 sec 38.2 MBytes 983 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-7.33 sec 0.00 Bytes 0.00 bits/sec sender
[ 5] 0.00-7.33 sec 858 MBytes 982 Mbits/sec receiver
iperf3: interrupt - the client has terminated
I started with disabling CAKE SQM, and that restored it back to the full line rate, and CPU usage increase was barely noticeable (not counting the CPU usage for iperf3 itself). That initially led me to think there was something in the SQM setup that was causing some slowdown in the kernel (maybe the redirection of packets to the ifb interface that gets created?). However, I couldn't see any reports of this online.
I then set up the same CAKE SQM on the device (192.168.3.21) to see if it was reproducible there, but iperf3 was able to send and receive at full line rate, so whatever it was, it was something specific to the server.
I used perf top
to see if anything stood out, and the top consumer was nftables. Comparing it to perf top
on the device, I didn't see nftables at the top of the list, which suggests it was related to the firewall. On a hunch, I stopped all of the pods and containers, and iperf3 was able to achieve line rate in both directions.
I looked at the rules generated, and the netavark_zone
is adding each subnet of each network, rather than adding the interface of each network into the zone. Because of this, there are 3x as many entries. Additionally, because I have some firewall policies between netavark_zone
and the host (and other zones), there are some additional rules being generated.
I modified the zone definition to have the interfaces instead of the subnets, and after making this change, iperf3 can achieve line rate, with no noticeable CPU increase.
This suggests one or more of the following:
- Nftables is inefficient for some reason when parsing/matching IPv4 or IPv6 addresses.
- Firewalld is generating rules for nftables inefficiently and is not taking advantages of built-in features (such as sets).
- The number of rules that need to be generated because of the use of IPv4/IPv6 addresses instead of interfaces is so much greater that it slows down the firewall processing code.
I'm not sure about 1, but I'd like to think that this is fairly well optimized in the kernel. For 2, while writing this up, I checked firewalld's issue tracker to see if there's anything there, and found firewalld/firewalld#1399, so this is somewhat of a known issue there. For 3, I'm thinking that if there's no specific reason for using subnets instead of interfaces in the firewall rules, netavark could instead filter by interface instead of subnet, which would prevent rule explosion if there's both an IPv4 and IPv6 address present for the network.
Can suggestion 3 be looked into?
Are you using netavark with the firewalld driver? This driver is not recommend as it is super buggy see the other issue sin this repo.
Regardless both of our own iptables and nftables driver also match on subnets so I would assume if subnets are a problem that they face the same issue.
For 3, I'm thinking that if there's no specific reason for using subnets instead of interfaces in the firewall rules, netavark could instead filter by interface instead of subnet, which would prevent rule explosion if there's both an IPv4 and IPv6 address present for the network.
I did suggest using interfaces a while to @mheon but we decided that using subnets was better (as it has been what we have been doing historically) and there weren't really strong reasons for interfaces. If there are strong performance numbers we can certainly reconsider.
Problem chaining rule layout is most likely a breaking change as we still would need to remove/add rules when the old layout exists without causing conflicts.
Are you using netavark with the firewalld driver? This driver is not recommend as it is super buggy see the other issue sin this repo.
Ah, I see, I guess I got lucky in that for my use case, it works well enough (and for the issue where DNS isn't allowed by default, I can easily add a permanent rule to fix that). I'm not sure if I can safely use the iptables or nftables drivers, since there are other rules I want in place, and I want to try to make sure there's no conflict/overlap between netavark's rules and my rules (including an easy way of modifying the firewall rules at runtime).
Regardless both of our own iptables and nftables driver also match on subnets so I would assume if subnets are a problem that they face the same issue.
I wonder if it's more of a firewalld issue primarily because of the nature of rules that are generated, in that there are a ton of jump targets that don't do anything and that sets are not being used at all. It's entirely possible iptables and nftables would be able to hit this issue, but maybe at a much higher scale.
Problem chaining rule layout is most likely a breaking change as we still would need to remove/add rules when the old layout exists without causing conflicts.
Not entirely sure what you mean by this, what's being chained?
Not entirely sure what you mean by this, what's being chained?
netavark speaks rule layout 1 today, now you update netvavark to rule lyaout 2. It no longer understands rule layout 1 unless we add (expensive rule migration work arounds) and as such it likely errors out if there is anything incompatible or doesn't remove the old subnet based rule leaving the system in a bad state. Of course a reboot would fix these things easily but not everybody does that.
Ah, understood, migration would be the tricky part here.