Performance issues when adding subnets to firewall instead of interface

Question

Performance issues when adding subnets to firewall instead of interface

Opened this issue a day ago · 4 comments

Not sure if this should be considered a netavark firewall issue, a kernel issue, or just performance limitations.

I have a server (Ryzen 5600X, 1 gigabit ethernet port with 4 VLANs on top of it) that is acting as my home router as well as running a bunch of containers. I have 47 containers running, across 11 networks (resulting in 11 network interfaces getting created). Each of those networks has a private IPv4 subnet, public IPv6 subnet, and a ULA (private) IPv6 subnet assigned. I'm using firewalld as my backend (which itself is using nftables as the backend), with other non-container-related rules configured for general connectivity and firewalling. I also have CAKE SQM set up in both the ingress and egress directions, with the bandwidth set to 1Gb/s. The SQM for the ingress direction is set up by redirecting packets that come in to an IFB interface; this is done by adding a tc filter rule.

Recently, I did a test with iperf3 between this server and two devices connected via ethernet. On both devices, TCP traffic from the server to the device is sent around 920Mb/s-970Mb/s (basically the full line rate), but TCP traffic from the device to the server is sent at a max of ~600Mb/s, with occasional drops to 200Mb/s. During this time, I can see that ksoftirqd is running at 100% on the server, suggesting that there's a CPU bottleneck involved. (When traffic is sent from the server to the device, ksoftirqd is not at 100% on either side.)

Here, 192.168.3.1 is the server which is running as a router and is running all of the containers.

$ iperf3 -c 192.168.3.1 -t 60
Connecting to host 192.168.3.1, port 5201
[  5] local 192.168.3.21 port 35770 connected to 192.168.3.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  73.8 MBytes   618 Mbits/sec    0    296 KBytes
[  5]   1.00-2.00   sec  73.5 MBytes   617 Mbits/sec    0    250 KBytes
[  5]   2.00-3.00   sec  73.0 MBytes   612 Mbits/sec    0    227 KBytes
[  5]   3.00-4.00   sec  72.6 MBytes   609 Mbits/sec    0    168 KBytes
[  5]   4.00-5.00   sec  69.5 MBytes   583 Mbits/sec    0    285 KBytes
[  5]   5.00-6.00   sec  72.6 MBytes   609 Mbits/sec    0    250 KBytes
[  5]   6.00-7.00   sec  72.1 MBytes   605 Mbits/sec    0    227 KBytes
[  5]   7.00-8.00   sec  70.4 MBytes   590 Mbits/sec    0    168 KBytes
[  5]   8.00-9.00   sec  72.1 MBytes   605 Mbits/sec    0    174 KBytes
[  5]   9.00-10.00  sec  71.2 MBytes   598 Mbits/sec    0    238 KBytes
[  5]  10.00-11.00  sec  72.6 MBytes   609 Mbits/sec    0    122 KBytes
[  5]  11.00-12.00  sec  72.1 MBytes   605 Mbits/sec    0    221 KBytes
[  5]  12.00-13.00  sec  72.6 MBytes   609 Mbits/sec    0    168 KBytes
[  5]  13.00-14.00  sec  71.8 MBytes   602 Mbits/sec    0    180 KBytes
[  5]  14.00-15.00  sec  68.4 MBytes   574 Mbits/sec    0    378 KBytes
[  5]  15.00-16.00  sec  71.2 MBytes   598 Mbits/sec    0    116 KBytes
[  5]  16.00-17.00  sec  71.2 MBytes   598 Mbits/sec    0    250 KBytes
[  5]  17.00-18.00  sec  72.6 MBytes   609 Mbits/sec    0    168 KBytes
[  5]  18.00-19.00  sec  72.1 MBytes   605 Mbits/sec    0    331 KBytes
[  5]  19.00-20.00  sec  73.0 MBytes   612 Mbits/sec    0    261 KBytes
[  5]  20.00-21.00  sec  72.6 MBytes   609 Mbits/sec    0    180 KBytes
[  5]  21.00-22.00  sec  70.4 MBytes   590 Mbits/sec    0    267 KBytes
[  5]  22.00-23.00  sec  72.1 MBytes   605 Mbits/sec    0    174 KBytes
[  5]  23.00-24.00  sec  72.2 MBytes   606 Mbits/sec    0    221 KBytes
[  5]  24.00-25.00  sec  69.6 MBytes   584 Mbits/sec    0    349 KBytes
[  5]  25.00-26.00  sec  71.2 MBytes   598 Mbits/sec    0    180 KBytes
[  5]  26.00-27.00  sec  71.2 MBytes   598 Mbits/sec    0    238 KBytes
...

$ iperf3 -c 192.168.3.1 --bidir
Connecting to host 192.168.3.1, port 5201
[  5] local 192.168.3.21 port 37654 connected to 192.168.3.1 port 5201
[  7] local 192.168.3.21 port 37670 connected to 192.168.3.1 port 5201
[ ID][Role] Interval           Transfer     Bitrate         Retr  Cwnd
[  5][TX-C]   0.00-1.00   sec  75.5 MBytes   633 Mbits/sec    0    459 KBytes
[  7][RX-C]   0.00-1.00   sec  90.5 MBytes   759 Mbits/sec
[  5][TX-C]   1.00-2.00   sec  76.5 MBytes   642 Mbits/sec    0    279 KBytes
[  7][RX-C]   1.00-2.00   sec  89.9 MBytes   754 Mbits/sec
[  5][TX-C]   2.00-3.00   sec  78.1 MBytes   655 Mbits/sec    0    378 KBytes
[  7][RX-C]   2.00-3.00   sec   106 MBytes   891 Mbits/sec
[  5][TX-C]   3.00-4.00   sec  78.4 MBytes   657 Mbits/sec    0    349 KBytes
[  7][RX-C]   3.00-4.00   sec  89.0 MBytes   747 Mbits/sec
[  5][TX-C]   4.00-5.00   sec  77.2 MBytes   648 Mbits/sec    0    192 KBytes
[  7][RX-C]   4.00-5.00   sec   108 MBytes   909 Mbits/sec
[  5][TX-C]   5.00-6.00   sec  79.4 MBytes   666 Mbits/sec    0    279 KBytes
[  7][RX-C]   5.00-6.00   sec  80.2 MBytes   673 Mbits/sec
[  5][TX-C]   6.00-7.00   sec  74.9 MBytes   628 Mbits/sec    0    325 KBytes
[  7][RX-C]   6.00-7.00   sec  97.5 MBytes   818 Mbits/sec
[  5][TX-C]   7.00-8.00   sec  65.1 MBytes   546 Mbits/sec    0    296 KBytes
[  7][RX-C]   7.00-8.00   sec  94.4 MBytes   792 Mbits/sec
[  5][TX-C]   8.00-9.00   sec  72.2 MBytes   606 Mbits/sec    0    267 KBytes
[  7][RX-C]   8.00-9.00   sec  99.2 MBytes   833 Mbits/sec
[  5][TX-C]   9.00-10.00  sec  78.1 MBytes   655 Mbits/sec    0    465 KBytes
[  7][RX-C]   9.00-10.00  sec   101 MBytes   846 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate         Retr
[  5][TX-C]   0.00-10.00  sec   756 MBytes   634 Mbits/sec    0             sender
[  5][TX-C]   0.00-10.00  sec   754 MBytes   633 Mbits/sec                  receiver
[  7][RX-C]   0.00-10.00  sec   957 MBytes   803 Mbits/sec    0             sender
[  7][RX-C]   0.00-10.00  sec   956 MBytes   802 Mbits/sec                  receiver

$ iperf3 -c 192.168.3.1 --reverse -t 60
Connecting to host 192.168.3.1, port 5201
Reverse mode, remote host 192.168.3.1 is sending
[  5] local 192.168.3.21 port 38648 connected to 192.168.3.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   117 MBytes   982 Mbits/sec
[  5]   1.00-2.00   sec   117 MBytes   984 Mbits/sec
[  5]   2.00-3.00   sec   117 MBytes   985 Mbits/sec
[  5]   3.00-4.00   sec   116 MBytes   976 Mbits/sec
[  5]   4.00-5.00   sec   117 MBytes   985 Mbits/sec
[  5]   5.00-6.00   sec   117 MBytes   984 Mbits/sec
[  5]   6.00-7.00   sec   117 MBytes   983 Mbits/sec
^C[  5]   7.00-7.33   sec  38.2 MBytes   983 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-7.33   sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-7.33   sec   858 MBytes   982 Mbits/sec                  receiver
iperf3: interrupt - the client has terminated

I started with disabling CAKE SQM, and that restored it back to the full line rate, and CPU usage increase was barely noticeable (not counting the CPU usage for iperf3 itself). That initially led me to think there was something in the SQM setup that was causing some slowdown in the kernel (maybe the redirection of packets to the ifb interface that gets created?). However, I couldn't see any reports of this online.

I then set up the same CAKE SQM on the device (192.168.3.21) to see if it was reproducible there, but iperf3 was able to send and receive at full line rate, so whatever it was, it was something specific to the server.

I used perf top to see if anything stood out, and the top consumer was nftables. Comparing it to perf top on the device, I didn't see nftables at the top of the list, which suggests it was related to the firewall. On a hunch, I stopped all of the pods and containers, and iperf3 was able to achieve line rate in both directions.

I looked at the rules generated, and the netavark_zone is adding each subnet of each network, rather than adding the interface of each network into the zone. Because of this, there are 3x as many entries. Additionally, because I have some firewall policies between netavark_zone and the host (and other zones), there are some additional rules being generated.

I modified the zone definition to have the interfaces instead of the subnets, and after making this change, iperf3 can achieve line rate, with no noticeable CPU increase.

This suggests one or more of the following:

Nftables is inefficient for some reason when parsing/matching IPv4 or IPv6 addresses.
Firewalld is generating rules for nftables inefficiently and is not taking advantages of built-in features (such as sets).
The number of rules that need to be generated because of the use of IPv4/IPv6 addresses instead of interfaces is so much greater that it slows down the firewall processing code.

I'm not sure about 1, but I'd like to think that this is fairly well optimized in the kernel. For 2, while writing this up, I checked firewalld's issue tracker to see if there's anything there, and found firewalld/firewalld#1399, so this is somewhat of a known issue there. For 3, I'm thinking that if there's no specific reason for using subnets instead of interfaces in the firewall rules, netavark could instead filter by interface instead of subnet, which would prevent rule explosion if there's both an IPv4 and IPv6 address present for the network.

Can suggestion 3 be looked into?

Answer 1 · 2025-01-06T10:56:21.000Z

Are you using netavark with the firewalld driver? This driver is not recommend as it is super buggy see the other issue sin this repo.

Regardless both of our own iptables and nftables driver also match on subnets so I would assume if subnets are a problem that they face the same issue.

For 3, I'm thinking that if there's no specific reason for using subnets instead of interfaces in the firewall rules, netavark could instead filter by interface instead of subnet, which would prevent rule explosion if there's both an IPv4 and IPv6 address present for the network.

I did suggest using interfaces a while to @mheon but we decided that using subnets was better (as it has been what we have been doing historically) and there weren't really strong reasons for interfaces. If there are strong performance numbers we can certainly reconsider.
Problem chaining rule layout is most likely a breaking change as we still would need to remove/add rules when the old layout exists without causing conflicts.

Answer 2 · 2025-01-06T18:10:47.000Z

Are you using netavark with the firewalld driver? This driver is not recommend as it is super buggy see the other issue sin this repo.

Ah, I see, I guess I got lucky in that for my use case, it works well enough (and for the issue where DNS isn't allowed by default, I can easily add a permanent rule to fix that). I'm not sure if I can safely use the iptables or nftables drivers, since there are other rules I want in place, and I want to try to make sure there's no conflict/overlap between netavark's rules and my rules (including an easy way of modifying the firewall rules at runtime).

Regardless both of our own iptables and nftables driver also match on subnets so I would assume if subnets are a problem that they face the same issue.

I wonder if it's more of a firewalld issue primarily because of the nature of rules that are generated, in that there are a ton of jump targets that don't do anything and that sets are not being used at all. It's entirely possible iptables and nftables would be able to hit this issue, but maybe at a much higher scale.

Problem chaining rule layout is most likely a breaking change as we still would need to remove/add rules when the old layout exists without causing conflicts.

Not entirely sure what you mean by this, what's being chained?

Answer 3 · 2025-01-06T18:40:36.000Z

Not entirely sure what you mean by this, what's being chained?

netavark speaks rule layout 1 today, now you update netvavark to rule lyaout 2. It no longer understands rule layout 1 unless we add (expensive rule migration work arounds) and as such it likely errors out if there is anything incompatible or doesn't remove the old subnet based rule leaving the system in a bad state. Of course a reboot would fix these things easily but not everybody does that.

Answer 4 · 2025-01-07T16:44:13.000Z

Ah, understood, migration would be the tricky part here.