google/nftables

Not all response messages are received causing the receive buffer to overflow

maxmoehl opened this issue · 8 comments

This is closely related to #103.

I was observing a similar issue: after writing about 240 rules I'd receive conn.Receive: netlink receive: recvmsg: no buffer space available. The rules I'm writing are always created in pairs, two AddRule calls are always followed by a Flush. Using the NLDEBUG=level=1 to produce debug output from the netlink library I was able to observe that my application executed 490 calls to SendMessages but only around 320 messages were received (this doesn't match with the 240 rules I see but I'll ignore that for now).

Looking at the receive logic it seems like it doesn't really care how many messages are read back:

nftables/conn.go

Lines 234 to 241 in d27cc52

for _, msg := range cc.messages {
if msg.Header.Flags&netlink.Acknowledge == 0 {
continue // message did not request an acknowledgement
}
if _, err := conn.Receive(); err != nil {
return fmt.Errorf("conn.Receive: %w", err)
}
}

We may, or may not, get multiple messages with that single receive call. What should be done at this stage is to ensure we read as many messages as we expect.

The amount of messages returned by each call to Receive seems to depend on a number of variables but it looks like netlink is unable to return more than two (and often not even two) at a time. This causes messages to collect in the receive buffer as we call Receive only once per message sent.

To confirm my theory I've implemented a fix in #236. With the fix applied I was able to write 10.000 rules continuously without any changes to my logic.

turekt commented

Hi @maxmoehl,

I believe this is closely related to what I have stumbled upon once and tried to address in #191 with receive retries. See the following comment for more details: #191 (comment)

I ended up opening a PR in netlink as agreed in the original nftables PR #191, but it is not yet reviewed: mdlayher/netlink#200 (comment)

Are you utilizing nftables lasting connection? Have you tried changing this to a standard connection and see how your program behaves?

Thanks for the references! I will take a closer look at those to better understand the issue.

Yes we are using a lasting connection, it seemed like a reasonable choice given that we know we will be sending 1000+ messages.

I too saw that there was another issue (see the second comment in my PR) however the change still improved the behaviour in some situations (but I have to admit that I never tested it properly beyond a few test runs).

turekt commented

I have further looked at the issues I have previously sent you and I found that the specific case that I was investigating was resolved in #194, but please still check all links to confirm whether these are related to your case or not.

Since you have confirmed that you are using lasting connection, I can just say at this point that lasting conn behaves differently than the standard one and it is possible that you have found a case which is not covered by receiveAckAware. To be completely sure, I would need to take some time to verify this.

Can you maybe share a code snippet that reproduces the issue? In the meantime, you can transfer to using a standard connection as I would expect that it is not affected by this issue. Your best bet for fixing this would be to check how original nftables handles these messages.

I just tried running it with a non-lasting connection and it seems to be working fine (which makes sense, given that each connection will have its own buffer which is discarded after one operation).

However, as expected, the overhead is quite high. All benchmarks I currently have take about twice as long compared to the lasting connection. This could be tweaked by running more than one operation at a time and therefore increasing the number of rules written per flush. But at some point I will then run into the send buffer overflowing because there isn't any protection against that as well.

I've put a working example here. These are the results I got:

RUN 1.X is using v0.1.0 of the nftables library, for RUN 2.X I added this line to the go mod file to use my "fixed" version:

replace github.com/google/nftables => github.com/maxmoehl/nftables v0.0.0-20230810155609-6773488cc03f
$ # RUN 1.1
$ go build -o nftables-test .; sudo ip netns add nftables-test; sudo LASTING_CONN=1 RULES_PER_FLUSH=1 ip netns exec nftables-test ./nftables-test; sudo ip netns delete nftables-test
0
100
failed at flush 113
panic: conn.Receive: netlink receive: recvmsg: no buffer space available

goroutine 1 [running]:
main.p(...)
	/home/<masked>/main.go:96
main.main()
	/home/<masked>/main.go:89 +0x67b
$ # RUN 1.2
$ go build -o nftables-test .; sudo ip netns add nftables-test; sudo LASTING_CONN=1 RULES_PER_FLUSH=2 ip netns exec nftables-test ./nftables-test; sudo ip netns delete nftables-test
0
failed at flush 92
panic: conn.Receive: netlink receive: recvmsg: no buffer space available

goroutine 1 [running]:
main.p(...)
	/home/<masked>/main.go:96
main.main()
	/home/<masked>/main.go:89 +0x67b
# RUN 2.1
$ go build -o nftables-test .; sudo ip netns add nftables-test; sudo LASTING_CONN=1 RULES_PER_FLUSH=1 ip netns exec nftables-test ./nftables-test; sudo ip netns delete nftables-test
0
100
200
failed at flush 225
panic: conn.Receive: netlink receive: recvmsg: no buffer space available

goroutine 1 [running]:
main.p(...)
	/home/<masked>/main.go:96
main.main()
	/home/<masked>/main.go:89 +0x67b
# RUN 2.2
$ go build -o nftables-test .; sudo ip netns add nftables-test; sudo LASTING_CONN=1 RULES_PER_FLUSH=2 ip netns exec nftables-test ./nftables-test; sudo ip netns delete nftables-test
0
100
200
300
400
500
600
700
800
900
<...>
6700
^C

For completeness: the tests have been run an AWS VM with the following specs:

$ uname -a
Linux workstation 5.19.0-1029-aws #30~22.04.1-Ubuntu SMP Thu Jul 13 17:17:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good n
                         opl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdr
                         and hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflus
                         hopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Regarding the change you made: that only applies to get operations. So far I'm only seeing issues with writing things. But at the end it's a similar issue, the responses we should expect are not properly read. The netlink protocol has some features to ensure reliable transmission. If the responses are simply discarded all of these guarantees are lost. It could even be that the kernel is requesting an ACK from the client, if we don't respond to that it might be sending the same message indefinitely causing the receive buffer to overflow.

This message is actually used to denote both an ACK and a NACK.
Typically, the direction is from FEC to CPC (in response to an ACK
request message). However, the CPC should be able to send ACKs back
to FEC when requested.

RFC 3549 Section 2.3.2.2

turekt commented

Hi @maxmoehl,

I have tested your snippet and I believe I see where the issue is and it is related to receiveAckAware and #194.

From the output it is seen that in your lasting connection there are multiple messages sent as a response, for example:

nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:164669939 PID:12568} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:44 Type:unknown(2560) Flags:request|acknowledge|0x400 Sequence:164669940 PID:12568} Data:[2 ...
nl: send msgs: {Header:{Length:48 Type:unknown(2563) Flags:request|acknowledge|0x400 Sequence:164669941 PID:12568} Data:[2 ...
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:164669942 PID:12568} Data:[0 0 0 10]}
nl: recv: {Header:{Length:36 Type:error Flags:0x100 Sequence:164669940 PID:12568} Data:[0 0 0 0 44 0 0 0 0 10 5 4 244 169 2...
nl: recv: {Header:{Length:36 Type:error Flags:0x100 Sequence:164669941 PID:12568} Data:[0 0 0 0 48 0 0 0 3 10 5 4 245 169 2...
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:164669943 PID:12568} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:376 Type:unknown(2566) Flags:request|acknowledge|echo|0xc00 Sequence:164669944 PID:12568} Da...
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:164669945 PID:12568} Data:[0 0 0 10]}
nl: recv: {Header:{Length:388 Type:unknown(2566) Flags:0xc00 Sequence:164669944 PID:12568} Data:[2 0 55 240 16 0 1 0 98 117...
nl: recv: {Header:{Length:48 Type:unknown(2575) Flags:0 Sequence:164669944 PID:12568} Data:[0 0 55 240 8 0 1 0 0 0 55 240 8...
nl: send msgs: {Header:{Length:20 Type:unknown(16) Flags:request Sequence:164669946 PID:12568} Data:[0 0 0 10]}
nl: send msgs: {Header:{Length:376 Type:unknown(2566) Flags:request|acknowledge|echo|0xc00 Sequence:164669947 PID:12568} Da...
nl: send msgs: {Header:{Length:20 Type:unknown(17) Flags:request Sequence:164669948 PID:12568} Data:[0 0 0 10]}
nl: recv: {Header:{Length:36 Type:error Flags:0x100 Sequence:164669944 PID:12568} Data:[0 0 0 0 120 1 0 0 6 10 13 12 248 16...

Note the sequences we are sending and receiving. Observe how the last nl: recv is acknowledging sequence 164669944 while it should be acknowledging 164669947 (determined by the last set of nl: send messages).

In my observations I am seeing that netlink responds with a following set of messages after we initiate a request for a new rule:

  1. msg type 2566 (type set: netlink.Overrun | netlink.Error) - the data did not fit into a single message
  2. msg type 2575 (type set: netlink.Overrun | netlink.Done | netlink.Noop) - rest of the data, no more data after this
  3. msg type error (ack) - acknowledging your request

All of those messages are sent for each message requesting echo|acknowledge and since we are receiving only the first one (2566) with conn.Receive, we will receive the next one (2575) after we send the next echo|acknowledge message for creating a new rule and execute conn.Receive. Finally, we receive an ack for the first message after we send the third echo|acknowledge message. This eventually accumulates and results in a buffer overflow.

I have patched the receiveAckAware and replaced the conn.Receive call that you were trying to fix. Try with the following patch and let me know if it works for you:

diff --git a/conn.go b/conn.go
index 711d7f6..5c8f51d 100644
--- a/conn.go
+++ b/conn.go
@@ -154,14 +154,33 @@ func receiveAckAware(nlconn *netlink.Conn, sentMsgFlags netlink.HeaderFlags) ([]
 		return reply, nil
 	}
 
-	// Dump flag is not set, we expect an ack
+	if len(reply) != 0 {
+		last := reply[len(reply)-1]
+		for re := last.Header.Type; (re&netlink.Overrun) == netlink.Overrun && (re&netlink.Done) != netlink.Done; re = last.Header.Type {
+			// we are not finished, the message is overrun
+			r, err := nlconn.Receive()
+			if err != nil {
+				return nil, err
+			}
+			reply = append(reply, r...)
+			last = reply[len(reply)-1]
+		}
+
+		if last.Header.Type == netlink.Error && binaryutil.BigEndian.Uint32(last.Data[:4]) == 0 {
+			// we have already collected an ack
+			return reply, nil
+		}
+	}
+
+	// Now we expect an ack
 	ack, err := nlconn.Receive()
 	if err != nil {
 		return nil, err
 	}
 
 	if len(ack) == 0 {
-		return nil, errors.New("received an empty ack")
+		// received an empty ack?
+		return reply, nil
 	}
 
 	msg := ack[0]
@@ -232,10 +251,7 @@ func (cc *Conn) Flush() error {
 
 	// Fetch the requested acknowledgement for each message we sent.
 	for _, msg := range cc.messages {
-		if msg.Header.Flags&netlink.Acknowledge == 0 {
-			continue // message did not request an acknowledgement
-		}
-		if _, err := conn.Receive(); err != nil {
+		if _, err := receiveAckAware(conn, msg.Header.Flags); err != nil {
 			return fmt.Errorf("conn.Receive: %w", err)
 		}
 	}

I am not entirely sure whether this patch is complete. I can confirm that I cannot reproduce the issue anymore and that all tests are passing.

Yes, it works! Thanks!

Can you provide a PR for this change?