cyyself/wg-bench

Strange Performance on Raspbeery Pi 4

cyyself opened this issue · 7 comments

Based on the results, even the 2GHz Quad-Core A53 on TP-Link XDR 6088 can achieve 818 Mbits/sec. I doubt the Raspberry Pi 4's result of only 394 Mbits/sec is accurate as it has Quad-Core A72 @ 1.5GHz. Then, I switched back to the archlinuxarm-based PiKVM distro which my Raspberry PI 4 usually works on with armv7l kernel rather than aarch64 on Raspberry Pi OS, and ran the benchmark. Then, the result made me astonished.

| Device / CPU                   | OS / Kernel / iperf Param  | Speed          |
| Raspberry Pi 4 / BCM2711*      | Debian bookworm / 6.1.63   | 394 Mbits/sec  |
| Raspberry Pi 4 / BCM2711*      | archlinux / 6.1.61(armv7l) | 665 Mbits/sec  |

Using armv7l Kernel we will get about 69% faster, WHY?

I searched on the web and found a thread that has the same confusion as me but on AES rather than chacha20 used by wg[1]. It might be the chacha20 implementation in the kernel is not optimized in aarch64. I want to leave the issue here to record any further investigation of this performance issue.

[1] https://forums.raspberrypi.com/viewtopic.php?t=317075

Update: After directly flashing my rpi4 with OpenWRT 23.05.2 with Linux v5.15.137 compiled by OpenWRT, I got 1.01 Gbit/sec!

| Raspberry Pi 4 / BCM2711*      | OpenWRT 23.05.2 / 5.15.137 | 1.01 Gbits/sec |

One interesting finding: Use CONFIG_PREEMPT_NONE instead of CONFIG_PREEMPT in kernel config we can reach ~700Mbps on 6.1.y Kernel. CONFIG_PREEMPT_NONE is set by default in OpenWRT Kernel.

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 47296 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  78.2 MBytes   656 Mbits/sec    0    402 KBytes       
[  5]   1.00-2.00   sec  80.2 MBytes   672 Mbits/sec    0    441 KBytes       
[  5]   2.00-3.00   sec  79.6 MBytes   668 Mbits/sec    0    441 KBytes       
[  5]   3.00-4.00   sec  80.3 MBytes   674 Mbits/sec    0    441 KBytes       
[  5]   4.00-5.00   sec  80.8 MBytes   678 Mbits/sec    0    441 KBytes       
[  5]   5.00-6.00   sec  81.0 MBytes   679 Mbits/sec    0    441 KBytes       
[  5]   6.00-7.00   sec  79.5 MBytes   667 Mbits/sec    0    441 KBytes       
[  5]   7.00-8.00   sec  80.1 MBytes   672 Mbits/sec    0    441 KBytes       
[  5]   8.00-9.00   sec  80.1 MBytes   672 Mbits/sec    0    441 KBytes       
[  5]   9.00-10.00  sec  79.7 MBytes   668 Mbits/sec    0    441 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   799 MBytes   671 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   798 MBytes   669 Mbits/sec                  receiver

iperf Done.

Another interesting finding: Turn off CONFIG_FTRACE together with CONFIG_PREEMPT_NONE we can reach ~1.1Gbps on bcm2711_defconfig with rpi-6.1.y.

Connecting to host 169.254.200.2, port 5201
[  5] local 169.254.200.1 port 37182 connected to 169.254.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   135 MBytes  1.13 Gbits/sec    0    818 KBytes       
[  5]   1.00-2.00   sec   130 MBytes  1.09 Gbits/sec    0    860 KBytes       
[  5]   2.00-3.00   sec   126 MBytes  1.05 Gbits/sec    0    975 KBytes       
[  5]   3.00-4.00   sec   130 MBytes  1.09 Gbits/sec    0   1022 KBytes       
[  5]   4.00-5.00   sec   130 MBytes  1.09 Gbits/sec    0   1.07 MBytes       
[  5]   5.00-6.00   sec   132 MBytes  1.11 Gbits/sec    0   1.07 MBytes       
[  5]   6.00-7.00   sec   132 MBytes  1.11 Gbits/sec    0   1.14 MBytes       
[  5]   7.00-8.00   sec   132 MBytes  1.11 Gbits/sec    0   1.26 MBytes       
[  5]   8.00-9.00   sec   129 MBytes  1.08 Gbits/sec    0   1.26 MBytes       
[  5]   9.00-10.01  sec   130 MBytes  1.08 Gbits/sec    0   1.48 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec  1.28 GBytes  1.09 Gbits/sec    0             sender
[  5]   0.00-10.01  sec  1.27 GBytes  1.09 Gbits/sec                  receiver

iperf Done.

However, if we turn off CONFIG_FTRACE then a series of configurations that depend on it will also be turned off. Thus, we will need further investigation to see what config hinders the performance.

104d103
< # CONFIG_BPF_LSM is not set
139d137
< CONFIG_TASKS_RUDE_RCU=y
260d257
< CONFIG_TRACEPOINTS=y
1603d1599
< # CONFIG_BATMAN_ADV_TRACING is not set
1637d1632
< # CONFIG_NET_DROP_MONITOR is not set
2965d2959
< # CONFIG_ATH6KL_TRACING is not set
8041d8034
< # CONFIG_PSTORE_FTRACE is not set
8492d8484
< # CONFIG_TRACE_MMIO_ACCESS is not set
8726d8717
< # CONFIG_DEBUG_PAGE_REF is not set
8803,8804d8793
< CONFIG_TRACE_IRQFLAGS=y
< CONFIG_TRACE_IRQFLAGS_NMI=y
8837d8825
< CONFIG_NOP_TRACER=y
8845d8832
< CONFIG_TRACER_MAX_TRACE=y
8847,8853d8833
< CONFIG_RING_BUFFER=y
< CONFIG_EVENT_TRACING=y
< CONFIG_CONTEXT_SWITCH_TRACER=y
< CONFIG_RING_BUFFER_ALLOW_SWAP=y
< CONFIG_PREEMPTIRQ_TRACEPOINTS=y
< CONFIG_TRACING=y
< CONFIG_GENERIC_TRACER=y
8855,8895c8835
< CONFIG_FTRACE=y
< # CONFIG_BOOTTIME_TRACING is not set
< CONFIG_FUNCTION_TRACER=y
< CONFIG_FUNCTION_GRAPH_TRACER=y
< CONFIG_DYNAMIC_FTRACE=y
< CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
< CONFIG_FUNCTION_PROFILER=y
< CONFIG_STACK_TRACER=y
< CONFIG_IRQSOFF_TRACER=y
< CONFIG_SCHED_TRACER=y
< # CONFIG_HWLAT_TRACER is not set
< # CONFIG_OSNOISE_TRACER is not set
< # CONFIG_TIMERLAT_TRACER is not set
< # CONFIG_FTRACE_SYSCALLS is not set
< CONFIG_TRACER_SNAPSHOT=y
< CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP=y
< CONFIG_BRANCH_PROFILE_NONE=y
< # CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
< # CONFIG_PROFILE_ALL_BRANCHES is not set
< CONFIG_BLK_DEV_IO_TRACE=y
< CONFIG_KPROBE_EVENTS=y
< # CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
< # CONFIG_UPROBE_EVENTS is not set
< CONFIG_BPF_EVENTS=y
< CONFIG_DYNAMIC_EVENTS=y
< CONFIG_PROBE_EVENTS=y
< CONFIG_FTRACE_MCOUNT_RECORD=y
< CONFIG_FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY=y
< # CONFIG_SYNTH_EVENTS is not set
< # CONFIG_HIST_TRIGGERS is not set
< # CONFIG_TRACE_EVENT_INJECT is not set
< # CONFIG_TRACEPOINT_BENCHMARK is not set
< # CONFIG_RING_BUFFER_BENCHMARK is not set
< # CONFIG_TRACE_EVAL_MAP_FILE is not set
< # CONFIG_FTRACE_RECORD_RECURSION is not set
< # CONFIG_FTRACE_STARTUP_TEST is not set
< # CONFIG_RING_BUFFER_STARTUP_TEST is not set
< # CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
< # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
< # CONFIG_KPROBE_EVENT_GEN_TEST is not set
< # CONFIG_RV is not set
---
> # CONFIG_FTRACE is not set

Yet another interesting finding: turn off CONFIG_IRQSOFF_TRACER along with CONFIG_PREEMPT_NONE can also reach ~1.1Gbps.

Turn on CONFIG_IRQSOFF_TRACER will also affect the following configurations:

8803a8804,8805
> CONFIG_TRACE_IRQFLAGS=y
> CONFIG_TRACE_IRQFLAGS_NMI=y
8849a8852
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
8861c8864
< # CONFIG_IRQSOFF_TRACER is not set
---
> CONFIG_IRQSOFF_TRACER=y

In my RPi 4B, using OpenWrt 23.05.2 (64bit), the tested result was 881Mbps.

BTW I believe 32bit VS 64bit should show some difference, probably we should indicate this?

BTW I believe 32bit VS 64bit should show some difference, probably we should indicate this?

For an out-of-order CPU, 32bit vs 64bit shows same performance is normal, sometimes 64bit may slower for fatter pointer size which consumes more cache capacity. Intuitively we think 64bit will be fast is based on the register width doubled so it will be faster to processing something like 64-bit arithmetic operations only take one instruction to finish. But please remind that 64-bit operations also has longer latency on the CPU physical circuit which may needs to lower the frequency or more cycles to produce. It’s the same on SIMD.

The crypto algorithm in WireGuard is chacha20 and poly1305 also uses SIMD i.e. arm neon to calculate, if uarch implementation does not provide wide enough simd processing in a single cycle, we will get the same performance on whatever 32/64 bit.