Improve afring throughput by minimizing individual unsafe casts

Question

Improve afring throughput by minimizing individual unsafe casts

Closed this issue 10 months ago · 3 comments

Currently the code is structured in a manner that prefers function calls over efficiency when it comes to interacting with the current TPacketHeader in the buffer. For example, in order to perform the zero-copy payload extraction, multiple individual calls are performed to fetch the data:

return s.curTPacketHeader.payloadNoCopyAtOffset(0, s.curTPacketHeader.snapLen()),
  s.curTPacketHeader.packetType(),
  s.curTPacketHeader.pktLen(),

While these calls are perfectly inlined by the compiler, each of them performs an individual unsafe cast to the respective variable / type (which has an overhead of course). Cursory tests show that it might be more efficient to perform a single unsafe call and cast all relevant data onto a partial header struct (allocated on the stack) that encompasses all relevant data (conveniently, the required information is basically memory aligned in a single "blob" of three uint32 (plus a superfluous one that we can skip)). On a repeated and thorough benchmark this tiny change amounts to an increase of >20% in throughput:

                                     │    sec/op    │   sec/op     vs base                │
CaptureMethods/NextPayloadZeroCopy-4   24.79n ± 10%   19.20n ± 0%  -22.55% (p=0.000 n=25)

In addition, there are two caveats that bother me:

We are currently tracking / counting the number of packets in a block independently as part of the tPacketHeader struct (decrementing it by one each time a packet is fetched). However, looking at the implementation here, the final packet of a block is guaranteed to have the next offset set to zero (which is an abort criterion for the loop since it is never zero otherwise).
The main loop performing the PPOLL logic in nextPacket() is very intricate (due to the fact that it has to handle the scenario that it was unblocked and has to continue wherever it was prior to that event). Maybe there's a way to simplify the logic and reduce both overhead and code complexity.

DoD

Rewrite crucial paths for optimized data access (while still maintaining maximum readability)
Minimize duplicate tracking of number of packets per block
Reduce complexity of PPOLL logic
Further optimize assembler code

Answer 1 · 2023-09-11T11:32:39.000Z

@els0r Nice. After the first improvements I'm getting a significant improvement on all methods (even around 25% on the zero-copy methods):

                                      │ slimcap_baseline.txt │         slimcap_struct.txt          │
                                      │        sec/op        │   sec/op     vs base                │
CaptureMethods/NextPacket-4                      88.95n ± 1%   81.94n ± 0%   -7.88% (n=50)
CaptureMethods/NextPacketInPlace-4               39.13n ± 1%   35.72n ± 2%   -8.73% (p=0.000 n=50)
CaptureMethods/NextPayload-4                     80.95n ± 1%   71.03n ± 0%  -12.25% (n=50)
CaptureMethods/NextPayloadInPlace-4              27.54n ± 0%   27.88n ± 4%   +1.22% (p=0.000 n=50)
CaptureMethods/NextPayloadZeroCopy-4             25.32n ± 1%   19.42n ± 0%  -23.32% (n=50)
CaptureMethods/NextIPPacket-4                    81.11n ± 1%   73.26n ± 0%   -9.67% (n=50)
CaptureMethods/NextIPPacketInPlace-4             35.91n ± 1%   28.25n ± 3%  -21.33% (n=50)
CaptureMethods/NextIPPacketZeroCopy-4            25.64n ± 1%   18.58n ± 0%  -27.55% (n=50)
CaptureMethods/NextPacketFn-4                    26.98n ± 0%   20.26n ± 1%  -24.93% (n=50)

Note: NextPayloadInPlace did not improve on paper, but that's due to the fact that I made a mistake in the initial implementation, it was actually doing a (faster) zero-copy operation although not specified (which could have been dangerous of course). So in fact it is now about as fast without as the zero-copy operation was before 🤣

Answer 2 · 2023-09-12T14:42:40.000Z

Refactoring the nextPacket() logic now has a much better / cleaner call stack:

As can be seen, all layers are inlined by the compiler, with the exception of the heavy-lifting in nextPacketZeroCopy() (which cannot be inlined due to its complexity). This way, the number of function calls is kept to the absolute minimum possible (the caller basically "runs" nextPacketZeroCopy() directly).

Answer 3 · 2023-09-25T13:33:45.000Z

As discussed. This is so (!) cool. I'll provide feedback as soon as I can build this internally again.