Strange performance problems when going below 6 cores per dual port

Question

Strange performance problems when going below 6 cores per dual port

Civil opened this issue 4 months ago · 0 comments

I'm currently trying to benchmark NICs on a system that have relatively low core count (16 cores, 32 threads) and I encounter a strange scalability problem when I'm trying to allocate less than 6 cores per dual port.

With 6 cores per dual port on 5 NICs I can easily get about 100 Mpps per NIC TX (and 65-85 Mpps RX, depending on a NIC, which is in line with what )

This is performance on ConnectX-6 with 6 cores per dual-port:

vtune hotspots:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 68.307s
    CPU Time: 341.346s
        Effective Time: 341.346s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 17
    Paused Time: 0s

Top Hotspots
Function                                                                                                  Module         CPU Time  % of CPU Time(%)
--------------------------------------------------------------------------------------------------------  -------------  --------  ----------------
rte_rdtsc                                                                                                 _t-rex-64      132.300s             38.8%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push  _t-rex-64       38.642s             11.3%
mlx5_tx_burst_empw_inline                                                                                 libmlx5-64.so   19.090s              5.6%
mlx5_tx_cseg_init                                                                                         libmlx5-64.so   15.152s              4.4%
CNodeGenerator::handle_stl_node                                                                           _t-rex-64       11.840s              3.5%
[Others]                                                                                                  N/A            124.322s             36.4%
Effective Physical Core Utilization: 31.6% (5.049 out of 16)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 15.8% (5.061 out of 32)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ./_t-rex-64 "-i" "-c" "6" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
    Operating System: 6.5.0-0.deb12.4-amd64 12.5
    Computer Name: spr-testbench
    Result Size: 13.2 MB
    Collection start time: 22:11:36 14/03/2024 UTC
    Collection stop time: 22:12:46 14/03/2024 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 3.096 GHz
        Logical CPU Count: 32
        LLC size: 47.2 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

Some information from vtune performance-snapshot:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 45.157s
    IPC: 2.652
    SP GFLOPS: 0.000
    DP GFLOPS: 0.562
    Average CPU Frequency: 4.918 GHz
Logical Core Utilization: 16.4% (5.255 out of 32)
    Physical Core Utilization: 32.8% (5.241 out of 16)
Microarchitecture Usage: 43.4% of Pipeline Slots
    Retiring: 43.4% of Pipeline Slots
        Light Operations: 38.7% of Pipeline Slots
        Heavy Operations: 4.7% of Pipeline Slots
    Front-End Bound: 2.2% of Pipeline Slots
        Front-End Latency: 0.6% of Pipeline Slots
        Front-End Bandwidth: 1.6% of Pipeline Slots
    Bad Speculation: 1.2% of Pipeline Slots
        Branch Mispredict: 0.8% of Pipeline Slots
        Machine Clears: 0.4% of Pipeline Slots
    Back-End Bound: 53.2% of Pipeline Slots
        Memory Bound: 14.2% of Pipeline Slots
            L1 Bound: 2.6% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 1.0% of Clockticks
                L3 Latency: 0.2% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.1% of Clockticks
                Memory Latency: 9.2% of Clockticks
                    Local DRAM: 0.0% of Clockticks
                    Remote DRAM: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
        Core Bound: 39.0% of Pipeline Slots
Memory Bound: 14.2% of Pipeline Slots
    Cache Bound: 3.7% of Clockticks
    DRAM Bound: 0.0% of Clockticks
        DRAM Bandwidth Bound: 0.0% of Elapsed Time
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        HP FLOPs: 0.0% of uOps
            Packed: 0.0%
                128-bit: 0.0%
                256-bit: 0.0%
                512-bit: 0.0%
            Scalar: 0.0%
        SP FLOPs: 0.0% of uOps
            Packed: 11.4% from SP FP
                128-bit: 11.0% from SP FP
                256-bit: 0.4% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 88.6% from SP FP
        DP FLOPs: 0.8% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        AMX BF16 FLOPs: 0.0% of uOps
        x87 FLOPs: 0.0% of uOps
        Non-FP: 99.2% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.031
    FP Arith/Mem Wr Instr. Ratio: 0.069

PCIe Bandwidth: 13.345 GB/s
PCI Device Class    PCIe Bandwidth, GB/s
------------------  --------------------
Network controller                13.345
Bridge                             0.000
[Unknown]                          0.000

And here is for 5 cores per dualport:

Here it is clearly visible drop in TX from 100 Mpps per port to 25 Mpps (1/4 of original for only one core difference) and performance is WAY less stable (it can briefly go up to 40 Mpps but drops)

From vtune hotspots:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 82.541s
    CPU Time: 381.038s
        Effective Time: 381.038s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 16
    Paused Time: 0s

Top Hotspots
Function                                                                                                  Module         CPU Time  % of CPU Time(%)
--------------------------------------------------------------------------------------------------------  -------------  --------  ----------------
rte_rdtsc                                                                                                 _t-rex-64      295.096s             77.4%
std::priority_queue<CGenNode*, std::vector<CGenNode*, std::allocator<CGenNode*>>, CGenNodeCompare>::push  _t-rex-64       16.046s              4.2%
rte_delay_us_block                                                                                        _t-rex-64        6.892s              1.8%
rte_pause                                                                                                 _t-rex-64        5.934s              1.6%
check_cqe                                                                                                 libmlx5-64.so    4.816s              1.3%
[Others]                                                                                                  N/A             52.254s             13.7%
Effective Physical Core Utilization: 29.1% (4.657 out of 16)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 14.6% (4.668 out of 32)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ./_t-rex-64 "-i" "-c" "5" "--cfg" "/etc/trex_single.yaml" "--mlx5-so"
    Operating System: 6.5.0-0.deb12.4-amd64 12.5
    Computer Name: spr-testbench
    Result Size: 13.7 MB
    Collection start time: 22:16:15 14/03/2024 UTC
    Collection stop time: 22:17:39 14/03/2024 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 3.096 GHz
        Logical CPU Count: 32
        LLC size: 47.2 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

Composition is completely different and most of time is spent on getting time.

and vtune performance snapshot:

vtune: Executing actions 75 % Generating a report                              Elapsed Time: 59.648s
    IPC: 0.894
    SP GFLOPS: 0.000
    DP GFLOPS: 0.163
    Average CPU Frequency: 4.921 GHz
Logical Core Utilization: 13.4% (4.277 out of 32)
    Physical Core Utilization: 26.6% (4.263 out of 16)
Microarchitecture Usage: 19.9% of Pipeline Slots
    Retiring: 19.9% of Pipeline Slots
        Light Operations: 13.0% of Pipeline Slots
        Heavy Operations: 6.8% of Pipeline Slots
    Front-End Bound: 2.2% of Pipeline Slots
        Front-End Latency: 0.6% of Pipeline Slots
        Front-End Bandwidth: 1.6% of Pipeline Slots
    Bad Speculation: 0.5% of Pipeline Slots
        Branch Mispredict: 0.4% of Pipeline Slots
        Machine Clears: 0.0% of Pipeline Slots
    Back-End Bound: 77.4% of Pipeline Slots
        Memory Bound: 5.9% of Pipeline Slots
            L1 Bound: 1.2% of Clockticks
            L2 Bound: 0.0% of Clockticks
            L3 Bound: 1.1% of Clockticks
                L3 Latency: 1.6% of Clockticks
            DRAM Bound: 0.0% of Clockticks
                Memory Bandwidth: 0.1% of Clockticks
                Memory Latency: 3.5% of Clockticks
                    Local DRAM: 0.0% of Clockticks
                    Remote DRAM: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
        Core Bound: 71.5% of Pipeline Slots
Memory Bound: 5.9% of Pipeline Slots
    Cache Bound: 2.3% of Clockticks
    DRAM Bound: 0.0% of Clockticks
        DRAM Bandwidth Bound: 0.0% of Elapsed Time
    NUMA: % of Remote Accesses: 0.0%
Vectorization: 0.0% of Packed FP Operations
    Instruction Mix
        HP FLOPs: 0.0% of uOps
            Packed: 0.0%
                128-bit: 0.0%
                256-bit: 0.0%
                512-bit: 0.0%
            Scalar: 0.0%
        SP FLOPs: 0.0% of uOps
            Packed: 24.7% from SP FP
                128-bit: 24.7% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 75.3% from SP FP
        DP FLOPs: 0.7% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        AMX BF16 FLOPs: 0.0% of uOps
        x87 FLOPs: 0.0% of uOps
        Non-FP: 99.3% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.031
    FP Arith/Mem Wr Instr. Ratio: 0.072

PCIe Bandwidth: 4.280 GB/s
PCI Device Class    PCIe Bandwidth, GB/s
------------------  --------------------
Network controller                 4.280
Bridge                             0.000
[Unknown]                          0.000

That looks like COMPLETELY different workload which should be way less stressful for core (IPC <1 compared to 2.6 before) and that looks like some problem with scheduling logic within trex or some problem within DPDK.