Strange performance problems when going below 6 cores per dual port
Civil opened this issue · 0 comments
I'm currently trying to benchmark NICs on a system that have relatively low core count (16 cores, 32 threads) and I encounter a strange scalability problem when I'm trying to allocate less than 6 cores per dual port.
With 6 cores per dual port on 5 NICs I can easily get about 100 Mpps per NIC TX (and 65-85 Mpps RX, depending on a NIC, which is in line with what )
This is performance on ConnectX-6 with 6 cores per dual-port:
vtune hotspots:
vtune: Executing actions 75 % Generating a report Elapsed Time: 68.307s
CPU Time: 341.346s
Effective Time: 341.346s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 17
Paused Time: 0s
Top Hotspots
Function Module CPU Time % of CPU Time(%)
-------------------------------------------------------------------------------------------------------- ------------- -------- ----------------
rte_rdtsc _t-rex-64 132.300s 38.8%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push _t-rex-64 38.642s 11.3%
mlx5_tx_burst_empw_inline libmlx5-64.so 19.090s 5.6%
mlx5_tx_cseg_init libmlx5-64.so 15.152s 4.4%
CNodeGenerator::handle_stl_node _t-rex-64 11.840s 3.5%
[Others] N/A 124.322s 36.4%
Effective Physical Core Utilization: 31.6% (5.049 out of 16)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
| or run the Locks and Waits analysis to identify parallel bottlenecks for
| other parallel runtimes.
|
Effective Logical Core Utilization: 15.8% (5.061 out of 32)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first
| step and then look at opportunities to utilize logical cores, which in
| some cases can improve processor throughput and overall performance of
| multi-threaded applications.
|
Collection and Platform Info
Application Command Line: ./_t-rex-64 "-i" "-c" "6" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
Operating System: 6.5.0-0.deb12.4-amd64 12.5
Computer Name: spr-testbench
Result Size: 13.2 MB
Collection start time: 22:11:36 14/03/2024 UTC
Collection stop time: 22:12:46 14/03/2024 UTC
Collector Type: Event-based counting driver,User-mode sampling and tracing
CPU
Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
Frequency: 3.096 GHz
Logical CPU Count: 32
LLC size: 47.2 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available
Some information from vtune performance-snapshot:
vtune: Executing actions 75 % Generating a report Elapsed Time: 45.157s
IPC: 2.652
SP GFLOPS: 0.000
DP GFLOPS: 0.562
Average CPU Frequency: 4.918 GHz
Logical Core Utilization: 16.4% (5.255 out of 32)
Physical Core Utilization: 32.8% (5.241 out of 16)
Microarchitecture Usage: 43.4% of Pipeline Slots
Retiring: 43.4% of Pipeline Slots
Light Operations: 38.7% of Pipeline Slots
Heavy Operations: 4.7% of Pipeline Slots
Front-End Bound: 2.2% of Pipeline Slots
Front-End Latency: 0.6% of Pipeline Slots
Front-End Bandwidth: 1.6% of Pipeline Slots
Bad Speculation: 1.2% of Pipeline Slots
Branch Mispredict: 0.8% of Pipeline Slots
Machine Clears: 0.4% of Pipeline Slots
Back-End Bound: 53.2% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
L1 Bound: 2.6% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 1.0% of Clockticks
L3 Latency: 0.2% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.1% of Clockticks
Memory Latency: 9.2% of Clockticks
Local DRAM: 0.0% of Clockticks
Remote DRAM: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Core Bound: 39.0% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
Cache Bound: 3.7% of Clockticks
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
HP FLOPs: 0.0% of uOps
Packed: 0.0%
128-bit: 0.0%
256-bit: 0.0%
512-bit: 0.0%
Scalar: 0.0%
SP FLOPs: 0.0% of uOps
Packed: 11.4% from SP FP
128-bit: 11.0% from SP FP
256-bit: 0.4% from SP FP
512-bit: 0.0% from SP FP
Scalar: 88.6% from SP FP
DP FLOPs: 0.8% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 100.0% from DP FP
AMX BF16 FLOPs: 0.0% of uOps
x87 FLOPs: 0.0% of uOps
Non-FP: 99.2% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.031
FP Arith/Mem Wr Instr. Ratio: 0.069
PCIe Bandwidth: 13.345 GB/s
PCI Device Class PCIe Bandwidth, GB/s
------------------ --------------------
Network controller 13.345
Bridge 0.000
[Unknown] 0.000
And here is for 5 cores per dualport:
Here it is clearly visible drop in TX from 100 Mpps per port to 25 Mpps (1/4 of original for only one core difference) and performance is WAY less stable (it can briefly go up to 40 Mpps but drops)
From vtune hotspots:
vtune: Executing actions 75 % Generating a report Elapsed Time: 82.541s
CPU Time: 381.038s
Effective Time: 381.038s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 16
Paused Time: 0s
Top Hotspots
Function Module CPU Time % of CPU Time(%)
-------------------------------------------------------------------------------------------------------- ------------- -------- ----------------
rte_rdtsc _t-rex-64 295.096s 77.4%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push _t-rex-64 16.046s 4.2%
rte_delay_us_block _t-rex-64 6.892s 1.8%
rte_pause _t-rex-64 5.934s 1.6%
check_cqe libmlx5-64.so 4.816s 1.3%
[Others] N/A 52.254s 13.7%
Effective Physical Core Utilization: 29.1% (4.657 out of 16)
| The metric value is low, which may signal a poor physical CPU cores
| utilization caused by:
| - load imbalance
| - threading runtime overhead
| - contended synchronization
| - thread/process underutilization
| - incorrect affinity that utilizes logical cores instead of physical
| cores
| Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
| or run the Locks and Waits analysis to identify parallel bottlenecks for
| other parallel runtimes.
|
Effective Logical Core Utilization: 14.6% (4.668 out of 32)
| The metric value is low, which may signal a poor logical CPU cores
| utilization. Consider improving physical core utilization as the first
| step and then look at opportunities to utilize logical cores, which in
| some cases can improve processor throughput and overall performance of
| multi-threaded applications.
|
Collection and Platform Info
Application Command Line: ./_t-rex-64 "-i" "-c" "5" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
Operating System: 6.5.0-0.deb12.4-amd64 12.5
Computer Name: spr-testbench
Result Size: 13.7 MB
Collection start time: 22:16:15 14/03/2024 UTC
Collection stop time: 22:17:39 14/03/2024 UTC
Collector Type: Event-based counting driver,User-mode sampling and tracing
CPU
Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
Frequency: 3.096 GHz
Logical CPU Count: 32
LLC size: 47.2 MB
Cache Allocation Technology
Level 2 capability: available
Level 3 capability: available
Composition is completely different and most of time is spent on getting time.
and vtune performance snapshot:
vtune: Executing actions 75 % Generating a report Elapsed Time: 59.648s
IPC: 0.894
SP GFLOPS: 0.000
DP GFLOPS: 0.163
Average CPU Frequency: 4.921 GHz
Logical Core Utilization: 13.4% (4.277 out of 32)
Physical Core Utilization: 26.6% (4.263 out of 16)
Microarchitecture Usage: 19.9% of Pipeline Slots
Retiring: 19.9% of Pipeline Slots
Light Operations: 13.0% of Pipeline Slots
Heavy Operations: 6.8% of Pipeline Slots
Front-End Bound: 2.2% of Pipeline Slots
Front-End Latency: 0.6% of Pipeline Slots
Front-End Bandwidth: 1.6% of Pipeline Slots
Bad Speculation: 0.5% of Pipeline Slots
Branch Mispredict: 0.4% of Pipeline Slots
Machine Clears: 0.0% of Pipeline Slots
Back-End Bound: 77.4% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
L1 Bound: 1.2% of Clockticks
L2 Bound: 0.0% of Clockticks
L3 Bound: 1.1% of Clockticks
L3 Latency: 1.6% of Clockticks
DRAM Bound: 0.0% of Clockticks
Memory Bandwidth: 0.1% of Clockticks
Memory Latency: 3.5% of Clockticks
Local DRAM: 0.0% of Clockticks
Remote DRAM: 0.0% of Clockticks
Remote Cache: 0.0% of Clockticks
Core Bound: 71.5% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
Cache Bound: 2.3% of Clockticks
DRAM Bound: 0.0% of Clockticks
DRAM Bandwidth Bound: 0.0% of Elapsed Time
NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
Instruction Mix
HP FLOPs: 0.0% of uOps
Packed: 0.0%
128-bit: 0.0%
256-bit: 0.0%
512-bit: 0.0%
Scalar: 0.0%
SP FLOPs: 0.0% of uOps
Packed: 24.7% from SP FP
128-bit: 24.7% from SP FP
256-bit: 0.0% from SP FP
512-bit: 0.0% from SP FP
Scalar: 75.3% from SP FP
DP FLOPs: 0.7% of uOps
Packed: 0.0% from DP FP
128-bit: 0.0% from DP FP
256-bit: 0.0% from DP FP
512-bit: 0.0% from DP FP
Scalar: 100.0% from DP FP
AMX BF16 FLOPs: 0.0% of uOps
x87 FLOPs: 0.0% of uOps
Non-FP: 99.3% of uOps
FP Arith/Mem Rd Instr. Ratio: 0.031
FP Arith/Mem Wr Instr. Ratio: 0.072
PCIe Bandwidth: 4.280 GB/s
PCI Device Class PCIe Bandwidth, GB/s
------------------ --------------------
Network controller 4.280
Bridge 0.000
[Unknown] 0.000
That looks like COMPLETELY different workload which should be way less stressful for core (IPC <1 compared to 2.6 before) and that looks like some problem with scheduling logic within trex or some problem within DPDK.