virtio-win/kvm-guest-drivers-windows

netkvm: Enhancing Host Throughput by Combining Virtio Header and Data in a Single Memory Block for NetKVM

zjmletang opened this issue · 10 comments

Is your feature request related to a problem? Please describe.
In the current reception logic of netkvm, due to virtio protocol headers and data packets being in two separate memory blocks. so at least two memory blocks are needed for one descriptor. From the host's perspective, this means that the network card (hardware implementation) requires two DMA operations to retrieve a single packet, thus consuming more PCIe bandwidth. When PCIe bandwidth resources are strained, this leads to a performance bottleneck as the CPU retrieves descriptors from the Windows virtio frontend driver more slowly.

image

Describe the solution you'd like
I'd like to discuss with you whether this can be done: The virtio header and data packet can be combined into the same memory block, so that in most scenarios, a descriptor need only contain one memory block.

I conducted tests on our own cloud platform according to the improvement approach and the tests show that this method is fairly effective at increasing hardware bandwidth utilization, with the overall network card throughput increasing about 10%.

image

Related: https://issues.redhat.com/browse/RHEL-20220

I have been exploring the details surrounding VIRTIO_NET_F_MRG_RXBUF and found myself a bit perplexed regarding the association between the merging of the Virtio header and the packet as mentioned in the issue, in relation to the activation of VIRTIO_NET_F_MRG_RXBUF. It appears to me that the merging of the Virtio Header with the packet is feasible even when VIRTIO_NET_F_MRG_RXBUF is not utilized. If it's not too much trouble, I would greatly appreciate your insights on how these two aspects are interconnected.

@zjmletang
VIRTIO_NET_F_MRG_RXBUF is automatically enabled when the device indicates VIRTIO_F_VERSION_1, this is part of virtio spec (network part). Just indicating merged buffers does not make any effect. Existing driver in case when RSC is enabled allocates buffers of ~64K for each RX packet (this might be up to 17 pages), so the device does not have any motivation even to try using merged buffers. In case of no RSC - the driver uses one or more 4K buffers depending on Jumbo setting.

  1. Your table above indicates some improvement as a result of some modification. I suggest first of all that you share with us the modified code (just to understand the idea of the modification).
  2. For discussion it is important to understand:
  • which virtio features the device supports and which does not. The list of features of interest is here
  • how many queues the device supports/uses in typical performance scenarios
  • what are typical use cases you want to focus on? Large coalesced RX packets, mid-size RX packets, other?
  1. The issue Yan mentioned above does not focus on any specific modification. It is more about "Linux driver does use this feature. We (under Windows) do not use it at the moment, Is there something good we can have if we do and in which scenarios, if any?"
    So, I suggest to start from answering questions 1 and 2 and then we can make progress with this discussion.

@ybendito
thanks

  1. Sure, I will submit the code as a patch in a few days, just for discussion purposes.

  • which virtio features the device supports and which does not. The list of features of interest is here

host features 0x000001035867ffe3, guest features 0x00000003186799a3

  • how many queues the device supports/uses in typical performance scenarios

16

  • what are typical use cases you want to focus on? Large coalesced RX packets, mid-size RX packets, other?

We are more concerned with how this change would enhance the overall PCIe bandwidth on the host side.
image

@ybendito
Based on your suggestion, I've uploaded the code. This section of the code still requires optimization and now is solely for discussion purposes. please see #1089

@zjmletang

  1. as an idea, this seems suitable, this should decrement amount of DMA transactions for physical device and shouln't have a negative impact on paravirtualized one
  2. some time ago I worked on more flexible schema of memory layout. no PR yet the preview here
  3. Applying your suggestion on upstream at the moment seems to me kind of problematic: the change is very sensitive, the existing driver supports not only V1 devices but also legacy devices that must have the header in separate descriptor (see virtio spec 1.3, 5.1.6.6) and so the driver must be tested well in several non-default device configurations. I'm not sure we have enough resources for such coverage at the moment
  4. Not related to this, but to features - I seems strange that the feature ctrl_guest_offload(2) is not in. It is not complicated for implementation but without it some of flows might not work correctly.

@zjmletang
3. Applying your suggestion on upstream at the moment seems to me kind of problematic: the change is very sensitive, the existing driver supports not only V1 devices but also legacy devices that must have the header in separate descriptor (see virtio spec 1.3, 5.1.6.6) and so the driver must be tested well in several non-default device configurations. I'm not sure we have enough resources for such coverage at the moment

@ybendito ,
Indeed,it is very sensitive. Is it possible for us to make distinctions based on whether the VIRTIO_F_ANY_LAYOUT feature bit has been negotiated or not? If it has been successfully negotiated, we could consider merging; if not, we could maintain the separation into two segments.

cc @sb-ntnx fyi - this would be neat for us as well.