oxidecomputer/opte

Suspicious amount of time spent in `bad_packet_probe`

Closed this issue · 3 comments

I was capturing kernel flamegraphs for a Crucible Downstairs in the Madrid mini-cluster, and noticed that we're spending ~2% of our runtime in bad_packet_probe:

Screenshot 2024-02-14 at 4 45 34 PM

(and another 1% doing other log formatting in xde_rx)

This seems like a lot to me!

Attaching to the probe for 10 seconds, I see that it's mostly UnexpectedProtocol(TCP) errors.

BRM42220081 # dtrace -q -n 'bad-packet { @[(string)arg0, (string)arg3] = count() } tick-10s { exit(0) }'

  unknown    Parse(BadHeader("BadLength { len: 60 }"))                        80
  unknown    Parse(BadHeader("ReadError { error: NotEnoughBytes }"))         120
  unknown    Parse(UnexpectedProtocol(TCP))                                 4308

In other runs, I also saw

  unknown    Parse(BadHeader("BadLength { len: 72 }"))                         1
  unknown    Parse(UnexpectedEtherType(6C00))                                  1
  unknown    Parse(UnexpectedProtocol(ICMPv6))                                 1

This is on BRM42220081 in cubby 14, which is the scrimlet for the Madrid mini-cluster.

Thanks a lot for digging this up! Repeating some of what we discussed earlier, this seems to be inbound underlay traffic on cxgbe0 and cxgbe1. Ordinarily we expect to only see IPv6 + UDP + Geneve coming in, which is consistent with the parse errors there – OPTE is unhappy to see TCP, and I suspect it's trying to handle DDM solicits/advertisements as though they are actually Geneve (read: poorly).

Rather luckily, both fmt are right next to one another:

opte/xde/src/xde.rs

Lines 2064 to 2065 in d695968

bad_packet_parse_probe(None, Direction::In, mp_chain, &e);
opte::engine::dbg(format!("Rx bad packet: {:?}", e));

I'm keen to completely de-String this as far as SDTs are concerned. ParseError::BadHeader(String) is another bad apple but I need to actually think on that one, and it's not coming up for every non-OPTE underlay packet...

image

We seem to be guilty of the same crimes in Port::process_in, somewhere. To that box's left is xde alloc::ffi::c_str::CString::_from_vec_unchecked::h0f5b657f8d5c91b1...

Closed by #459.