Suspicious amount of time spent in `bad_packet_probe`
Closed this issue · 3 comments
I was capturing kernel flamegraphs for a Crucible Downstairs in the Madrid mini-cluster, and noticed that we're spending ~2% of our runtime in bad_packet_probe
:
![Screenshot 2024-02-14 at 4 45 34 PM](https://private-user-images.githubusercontent.com/745333/304885138-c0cbc103-d04f-486f-9f09-21f36d7e2e97.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAyMTkxMjgsIm5iZiI6MTcyMDIxODgyOCwicGF0aCI6Ii83NDUzMzMvMzA0ODg1MTM4LWMwY2JjMTAzLWQwNGYtNDg2Zi05ZjA5LTIxZjM2ZDdlMmU5Ny5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwNVQyMjMzNDhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00NmYxNmM3MGEwNGRkMWFlZGE5ZTQ4MmEyNGVlM2FmYTkwNjFhOWYwZGM0ZTBkOTA1YjkwNTE4NTJhNTE2MGZlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.Zv0qQxg2ZoXkSn4ReTKLyJZYHqdDN4MFm7MpwJYZJuw)
(and another 1% doing other log formatting in xde_rx
)
This seems like a lot to me!
Attaching to the probe for 10 seconds, I see that it's mostly UnexpectedProtocol(TCP)
errors.
BRM42220081 # dtrace -q -n 'bad-packet { @[(string)arg0, (string)arg3] = count() } tick-10s { exit(0) }'
unknown Parse(BadHeader("BadLength { len: 60 }")) 80
unknown Parse(BadHeader("ReadError { error: NotEnoughBytes }")) 120
unknown Parse(UnexpectedProtocol(TCP)) 4308
In other runs, I also saw
unknown Parse(BadHeader("BadLength { len: 72 }")) 1
unknown Parse(UnexpectedEtherType(6C00)) 1
unknown Parse(UnexpectedProtocol(ICMPv6)) 1
This is on BRM42220081 in cubby 14, which is the scrimlet for the Madrid mini-cluster.
Thanks a lot for digging this up! Repeating some of what we discussed earlier, this seems to be inbound underlay traffic on cxgbe0
and cxgbe1
. Ordinarily we expect to only see IPv6 + UDP + Geneve coming in, which is consistent with the parse errors there – OPTE is unhappy to see TCP, and I suspect it's trying to handle DDM solicits/advertisements as though they are actually Geneve (read: poorly).
Rather luckily, both fmt
are right next to one another:
Lines 2064 to 2065 in d695968
I'm keen to completely de-String
this as far as SDTs are concerned. ParseError::BadHeader(String)
is another bad apple but I need to actually think on that one, and it's not coming up for every non-OPTE underlay packet...
Closed by #459.