oxidecomputer/opte

OPTE incorrectly changing payload length on padded short packets

Closed this issue · 0 comments

On attempting to access Nexus on the dogfood rack, some folks on Windows noticed they weren't able to and were instead getting HTTP 400 responses or SSL errors. Also managed to reproduce it with a local omicron + SoftNPU deployment. After digging in with @rcgoodfellow we found some weird behaviour in the initial SYN/SYN-ACK/ACK sequence that was tripping things up.

Turns out on Windows, TCP timestamps are disabled by default which ends up meaning it may send the final ACK in the handshake as a zero-length packet with no TCP options. The IPv4 total length field would then be 20 (IPv4 header) + 20 (TCP header) = 40 bytes. Yet snoop inside the Nexus zone on its OPTE-layered vNIC showed a length of 46 bytes and indeed there were an extra 6 bytes at the end! Turns out 6 bytes of zero does not a valid HTTP(S) request make, hence the 400/SSL errors we were seeing.

Ok, so 2 questions now:

  1. Why are there an extra 6 zeros?
  2. Why is the length 46 bytes instead of 40 bytes?

The answer to (2) is because OPTE changed it. During packet processing we apply whatever transformations (removing outer Geneve encapsulation, rewriting addresses/ports for NAT, etc) and write out a new set of headers:

opte/opte/src/engine/packet.rs

Lines 1458 to 1471 in d367866

pub fn emit_new_headers(&mut self) -> Result<(), WriteError> {
// At this point the packet metadata represents the
// transformations made by the pipeline. We take the following
// steps to emit the new headers and update the packet data.
//
// 1. Figure out length required to emit the new headers.
//
// 2. Determine if this length can be met by the current first
// segment. If not, allocate a new segment to prepend to
// the xlist.
//
// 3. Emit the new header bytes based on the current metadata.
//
// 4. Update the headers offsets, body info, and checksums.

But in writing out the IPv4 total length, we calculate it based on the total payload we've received and hence include those 6 extra bytes:

ip4.total_len = (new_pkt_len - pkt_offset) as u16;

Now in this case, as far as OPTE is concerned, it should respect the original stated length here and that is something we should fix. But coming back to (1): why are there an extra 6 zeros? And where are they getting added? We took some packet captures along the way to try to answer that.

Nexus:      192.168.1.21  (External IP) / 172.30.1.5 (OPTE Private IP)
Windows:    192.168.2.110

First, as seen from Windows' perspective:

19:53:13.113798 IP (tos 0x0, ttl 128, id 34606, offset 0, flags [DF], proto TCP (6), length 40)
    192.168.2.110.50050 > 192.168.1.21.80: Flags [.], cksum 0x84ee (incorrect -> 0xb6b6), ack 1825894618, win 6147, length 0
        0x0000:  4500 0028 872e 4000 8006 eecd c0a8 026e  E..(..@........n
        0x0010:  c0a8 0115 c382 0050 3404 06c1 6cd4 f0da  .......P4...l...
        0x0020:  5010 1803 84ee 0000                      P.......

As mentioned before, we're sending out the final ACK in the handshake as a zero-length packet with no TCP options. The IPv4 total length is 40 bytes and there are no extra zeros, as expected. Since Windows is running in a VM here, we can also take a look at the packet leaving the Linux host on its physical interface:

19:53:13.113850 IP (tos 0x0, ttl 127, id 34606, offset 0, flags [DF], proto TCP (6), length 40)
    192.168.2.110.50050 > 192.168.1.21.80: Flags [.], cksum 0x84ee (incorrect -> 0xb6b6), ack 1825894618, win 6147, length 0
        0x0000:  4500 0028 872e 4000 7f06 efcd c0a8 026e  E..(..@........n
        0x0010:  c0a8 0115 c382 0050 3404 06c1 6cd4 f0da  .......P4...l...
        0x0020:  5010 1803 84ee 0000                      P.......

The TTL has gone down (128 -> 127) and the IP check sum is different, but everything else is the same as expected. So far so good then. Next step we can take a look at the packet coming in on the physical interface (e1000g0) in Helios (before SoftNPU):

________________________________
192.168.2.110 -> 192.168.1.21 ETHER Type=0800 (IP), size=60 bytes
192.168.2.110 -> 192.168.1.21 IP  D=192.168.1.21 S=192.168.2.110 LEN=40, ID=34606, TOS=0x0, TTL=127
192.168.2.110 -> 192.168.1.21 TCP D=80 S=50050 Ack=1825894618 Seq=872679105 Len=0 Win=6147

           0: a8e1 de01 701d e0d5 5e2b 9c25 0800 4500    ....p...^+.%..E.
          16: 0028 872e 4000 7f06 efcd c0a8 026e c0a8    .(..@........n..
          32: 0115 c382 0050 3404 06c1 6cd4 f0da 5010    .....P4...l...P.
          48: 1803 b6b6 0000 0000 0000 0000              ............

Well well would you look at that: the mysterious zeros have shown up, picked up somewhere along the way from Linux through the local router/switch and onto Helios.
At least the IPv4 length and derived TCP length are correct, 40 and 0, respectively. Which makes sense as we've yet to hit the OPTE codepath mentioned above. We can confirm that at least by also taking a look at the packet inside the Nexus zone:

________________________________
192.168.2.110 -> 172.30.1.5   ETHER Type=0800 (IP), size=60 bytes
192.168.2.110 -> 172.30.1.5   IP  D=172.30.1.5 S=192.168.2.110 LEN=46, ID=34606, TOS=0x0, TTL=127
192.168.2.110 -> 172.30.1.5   TCP D=80 S=50050 Ack=1825894618 Seq=872679105 Len=6 Win=6147
192.168.2.110 -> 172.30.1.5   HTTP (body)

           0: a840 25ff a4ed a840 25ff 7777 0800 4500    .@%....@%.ww..E.
          16: 002e 872e 4000 7f06 0462 c0a8 026e ac1e    ....@....b...n..
          32: 0105 c382 0050 3404 06c1 6cd4 f0da 5010    .....P4...l...P.
          48: 1803 cb4a 0000 0000 0000 0000              ...J........

OPTE has definitely touched this packet as we can see the 1-1 NAT for the external IP was applied (192.168.1.21 -> 172.30.1.5). The IPv4 length also got updated to 46 bytes which means what was originally a zero-length TCP packet with some extra bytes at the end is now treated as a TCP packet with a 6 byte payload.

So, we've confirmed the problematic OPTE behaviour in modifying inner IP length. As for the extra zeros, Ethernet frames have a minimum size of 64 bytes (which includes a 4 byte Frame Check Sequence (FCS) after the payload). The Linux igb driver sets the Pad Short Packets flag on the Transmit Control Register for the NIC (TCTL.PSP), for which the datasheet says:

Pad Short Packets
0b = Do not pad.
1b = Pad short packets.
Padding makes the packet 64 bytes long

Doing the math: 64 - (14 [Ethernet Header] + 20 [IPv4 Header] + 20 [TCP Header] + 4 [FCS]) = 6 bytes of padding and so mystery solved.