In the vanilla l2fwd DPDK example each thread (namely, DPDK core) receives a burst (set) of packets, does a swap of the src/dst MAC addresses and transmits back the same burst of modified packets. l2fwd-nv is an improvement of l2fwd to show the usage of mbuf pool with GPU data buffers using the vanilla DPDK API. The overall flow of the app is organised as follows:
- Create a number of pipelines, each one composed by:
- one core to receive and accumulate bursts of packets (RX core) from a dedicated DPDK queue (RX queue)
- a dedicated GPU/CPU workload entity which process the bursts
- one core to transmit the burst of modified packets (TX core) using a dedicated DPDK queue (TX queue)
- For each pipeline, in a loop:
- The RX core accumulates packets in bursts
- The RX core triggers (asynchronously) the work (MAC swapping) on the received burst using CUDA kernel(s)
- The TX core waits for the completion of the work
- The TX core sends the burst of modified packets
Please note that a single mempool is used for all the DPDK RX/TX queues. Using different command line option it's possible to:
- Create the mempool either in GPU memory or CPU pinned memory
- Decide how to do the MAC swapping in the packets:
- No workload: MAC addresses are not swapped, l2fwd-nv is doing basic I/O forwarding
- CPU workload: the CPU does the swap
- GPU workload: a new CUDA kernel is triggered for each burst of accumulated packets
- GPU persistent workload: a persistent CUDA kernel is triggered at the beginning on the CUDA stream dedicated to each pipeline. CPU has to communicate to this kernel that a new burst of packets has to be processed
- GPU workload with CUDA graphs: a number of CUDA kernels is triggered for the next 8 bursts of packets
- Enable buffer split feature: each received packet is split in two mbufs. 60B into a CPU memory mbuf, remaning bytes are stored into the a GPU memory mbufs. The worklaod in this case is swapping some random bytes.
Please note that not all the combinations give the best performance. This app should be considered a showcase to expose all the possibile combinations when dealing with GPUDirect RDMA and DPDK. l2fwd-nv has a trivial workload that doesn't really require the use of CUDA kernels.
03/11/2022
- Updated to DPDK 22.03
- GDRCopy direct calls removed in favour of new
gpudev
cpu_map functions - Code cleanup
11/26/2021
- Updated to the latest DPDK 21.11 release
- Introduced the new gpudev library
- Benchmarks updated to latest MOFED 5.4, DPDK 21.11 and CUDA 11.4 with V100 and A100
- Benchmarks executed using testpmd as packet generator
Please note that DPDK 22.03 is included as submodule of this project and it's built locally with l2fwd-nv.
Ensure that your kernel parameters include the following list:
default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 audit=0 idle=poll isolcpus=2-21 nohz_full=2-21 rcu_nocbs=2-21 rcu_nocb_poll nosoftlockup iommu=off intel_iommu=off
Note that 2-21
corresponds to the list of CPUs you intend to use for the DPDK application and the value of this parameter needs to be changed depending on the HW configuration.
To permanently include these items in the kernel parameters, open /etc/default/grub
with your favourite text editor and add them to the variable named GRUB_CMDLINE_LINUX_DEFAULT
. Save this file, install new GRUB configuration and reboot the server:
$ sudo vim /etc/default/grub
$ sudo update-grub
$ sudo reboot
After reboot, verify that the changes have been applied. As an example, to verify the system has 1 GB hugepages:
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-53-lowlatency root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 idle=poll isolcpus=2-21 nohz_full=2-21 rcu_nocbs=2-21 nosoftlockup iommu=off intel_iommu=off
$ grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 16
HugePages_Free: 15
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 16777216 kB
You need to follow few steps to configure your Mellanox network card.
- Download Mellanox OFED 5.4 from here
- Enable CQE compression
mlxconfig -d <NIC PCIe address> set CQE_COMPRESSION=1
If the Mellanox NIC supports IB and Ethernet mode (VPI adapters):
- Set the IB card as an Ethernet card
mlxconfig -d <NIC PCIe address> set LINK_TYPE_P1=2 LINK_TYPE_P2=2
- Reboot the server or
mlxfwreset -d <NIC PCIe address> reset
and/etc/init.d/openibd restart
Download and install the latest CUDA toolkit from here.
DPDK 22.03 requires Meson > 0.49.2.
sudo apt-get install python3-setuptools ninja-build
wget https://github.com/mesonbuild/meson/releases/download/0.56.0/meson-0.56.0.tar.gz
tar xvfz meson-0.56.0.tar.gz
cd meson-0.56.0
sudo python3 setup.py install
In order to enable GPUDirect RDMA with a Mellanox network card you need an additional kernel module.
If your system has CUDA 11.4 or newer, you need to load the nvidia_peermem
module
that comes with the NVIDIA CUDA Toolkit.
sudo modprobe nvidia-peermem
More info here.
If your system has an older CUDA version you need to manually build and install the nv_peer_memory
module.
git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
make
sudo insmod nv_peer_mem.ko
You can use cmake to build everything.
git clone --recurse-submodules https://github.com/NVIDIA/l2fwd-nv.git
cd l2fwd-nv
mkdir build
cd build
cmake ..
make -j$(nproc --all)
Starting from DPDK 22.03, GDRCopy has been embedded in DPDK and exposed through rte_gpu_mem_cpu_map
function.
The CMakeLists.txt file automatically builds GDRCopy libgdrapi.so
library.
After the build stage, you still need to launch gdrdrv
kernel module on the system.
cd external/gdrcopy
sudo ./insmod.sh
Please note that, to enable GDRCopy in l2fwd-nv at runtime, you need to set the env var
GDRCOPY_PATH_L
with the path to libgdrapi.so
library which resides in
/path/to/l2fwd-nv/external/gdrcopy/src
.
l2fwd-nv with -h
shows the usage and all the possible options
./build/l2fwdnv [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W
-b BURST SIZE: how many pkts x burst to RX
-d DATA ROOM SIZE: mbuf payload size
-g GPU DEVICE: GPU device ID
-m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory
-n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp
-p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use
-s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU
-t PACKET TIME: force workload time (nanoseconds) per packet
-v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C
-w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets, 3: GPU with a persistent CUDA kernel, 4: GPU with CUDA Graphs
-z WARMUP PKTS: wait this amount of packets before starting to measure performance
To run l2fwd-nv in an infinite loop options -z
and -w
must be set to 0.
To simulate an hevier workload per packet, the -t
parameter can be used to setup the number of nanoseconds per packet. This should help you to evaluate what's the best workload approach for your algorithm combining processing time per packet -t
with number of packets per burst -b
.
In the following benchmarks we report the forwarding throughput: assuming packet generator is able to transmit packets at the full linerate of 100Gbps, we're interested in the network throughput l2fwd-nv can reach retransmitting the packets.
In this section we report some performance analysis to highlight different l2fwd-nv configurations. Benchmarks executed with between two different machines connected back-to-back, one with l2fwd-nv and the other with testpmd.
We didn't observe any performance regression upgrading from DPDK 21.11 to DPDK 22.03.
HW features:
- GIGABYTE E251-U70
- CPU Xeon Gold 6240R. 2.4GHz. 24C48T
- NIC ConnectX-6 Dx (MT4125 - MCX623106AE-CDAT)
- NVIDIA GPU V100-PCIE-32GB
- NVIDIA GPU A100-PCIE-40GB
- PCIe bridge between NIC and GPU: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)
HW topology between NIC and GPU:
-+-[0000:b2]-+-00.0-[b3-b6]----00.0-[b4-b6]--+-08.0-[b5]--+-00.0 Mellanox Technologies MT28841
| | | \-00.1 Mellanox Technologies MT28841
| | \-10.0-[b6]----00.0 NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
SW features:
- Ubuntu 18.04 LTS
- Linux kernel 5.4.0-58-lowlatency
- GCC: 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)
- Mellanox OFED 5.4-3.1.0.0
- DPDK version: 21.11
- CUDA 11.4
Suggestes system configuration assuming a Mellanox network card with
bus id b5:00.0
and network interface enp181s0f0
:
mlxconfig -d b5:00.0 set CQE_COMPRESSION=1
mlxfwreset -d b5:00.0 r -y
ifconfig enp181s0f0 mtu 8192 up
ifconfig enp181s0f1 mtu 8192 up
ethtool -A enp181s0f0 rx off tx off
ethtool -A enp181s0f1 rx off tx off
sysctl -w vm.zone_reclaim_mode=0
sysctl -w vm.swappiness=0
PCIe Max Read Request:
$ sudo setpci -s b5:00.0 68.w
2930
$ setpci -s b5:00.0 68.w=5930
$ sudo lspci -s b5:00.0 -vvv | egrep "MaxRead"
MaxPayload 256 bytes, MaxReadReq 4096 bytes
In the following performance report, we used the testpmd packet generator that comes with the DPDK 21.11 code. The set of commands used to run and start testpmd is:
cd l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/app
sudo ./dpdk-testpmd -l 2-10 --main-lcore=2 -a b5:00.0 -- --port-numa-config=0,0 --socket-num=0 --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=8 --txq=8 --forward-mode=txonly -i --nb-cores=8 --txonly-multi-flow
testpmd> set txpkts <pkt size>
start
In order to measure network throughput, we used the mlnx_perf
application that comes with regular installation of MOFED.
Command line for mlnx_perf is:
mlnx_perf -i enp181s0f1
This tool reads network card numbers to determine number of sent and received bytes and calculate the data rate
tx_bytes_phy: 12,371,821,688 Bps = 98,974.57 Mbps
rx_bytes_phy: 12,165,283,124 Bps = 97,322.26 Mbps
In this test, GPU memory is used only to receive packets and transmit them back without and workload (I/O forwarding only)
in an infinite loop (no performance or warmup max packets). The number of packets received per workload (burst size -b
)
is fixed to 64 packets.
Assuming a system with Mellanox network card bus id b5:00.0
and an NVIDIA GPU with bus id b6:00.0
, the command line used is:
sudo GDRCOPY_PATH_L=./external/gdrcopy/src ./build/l2fwdnv -l 0-9 -n 8 -a b5:00.1,txq_inline_max=0 -a b6:00.0 -- -m 1 -w 0 -b 64 -p 4 -v 0 -z 0
Please note that, if libcuda.so
is not installed in the default system location, you need to specify
the path through the CUDA_PATH_L=/path/to/libcuda.so
env var.
Network throughput measured with mlnx_perf:
Packet bytes | Testpmd throughput | CPU memory throughput | GPU V100 memory throughput | GPU A100 memory throughput |
---|---|---|---|---|
64 | 74 Gbps | 18 Gbps | 19 Gbps | 19 Gbps |
128 | 82 Gbps | 36 Gbps | 37 Gbps | 37 Gbps |
256 | 82 Gbps | 68 Gbps | 67 Gbps | 67 Gbps |
512 | 97 Gbps | 97 Gbps | 94 Gbps | 95 Gbps |
1024 | 98 Gbps | 98 Gbps | 94 Gbps | 97 Gbps |
Please note that l2fwd-nv performance relies on the number of packets/sec rather than bytes/sec because the I/O (and the workload) doesn't depend on the lenght of the packet. In order to keep up with the line rate, in case of smaller packets, the generator has to send more packets/sec than in case of 1kB packets.
Here we compare I/O forwarding throughput using differnt GPU workloads:
- CUDA kernel (
-w 2
) - CUDA persistent kernel (
-w 3
) - CUDA Graph (
-w 4
)
Packet size is always 1kB, testpmd send throughput is ~98 Gbps and type of memory is GPU memory (-m 1
).
Benchmarks with V100:
Burst size | CUDA kernel throughput | CUDA Persistent kernel throughput | CUDA Graphs throughput |
---|---|---|---|
16 | 18 Gbps | 50 Gbps | 48 Gbps |
32 | 37 Gbps | 88 Gbps | 62 Gbps |
64 | 90 Gbps | 90 Gbps | 90 Gbps |
128 | 90 Gbps | 90 Gbps | 90 Gbps |
Benchmarks with A100:
Burst size | CUDA kernel throughput | CUDA Persistent kernel throughput | CUDA Graphs throughput |
---|---|---|---|
16 | 23 Gbps | 50 Gbps | 30 Gbps |
32 | 49 Gbps | 97 Gbps | 85 Gbps |
64 | 97 Gbps | 97 Gbps | 97 Gbps |
128 | 97 Gbps | 97 Gbps | 97 Gbps |
If the packet generator is sending non-canonical packets sizes (e.g. 1514B) cache alignment problems may slow down the performance in case of GPU memory.
To enhance performance you may try to use the EAL param rxq_pkt_pad_en=1
to the command line, e.g. -w b5:00.1,txq_inline_max=0,rxq_pkt_pad_en=1
.
More info in NVIDIA GTC'21 session S31972 - Accelerate DPDK Packet Processing Using GPU
E. Agostini